Overview

The MVAPICH2 software, supporting MPI 3.0 standard, delivers the best performance, scalability, and fault tolerance for high-end computing systems and servers using InfiniBand, Omni-Path, Ethernet/iWARP, and RoCE networking technologies. This software is being used by more than 3,125 organizations world-wide in 89 countries to extract the potential of these emerging networking technologies for modern systems.

Description

MVAPICH2 provides many features including MPI-3 standard compliance, single copy intra-node communication using Linux supported CMA (Cross Memory Attach), Checkpoint/Restart using LLNL's Scalable Checkpoint/Restart Library (SCR), high-performance and scalable InfiniBand hardware multicast-based collectives, enhanced shared-memory-aware and intra-node Zero-Copy collectives (using LiMIC), high-performance communication support for NVIDIA GPU with IPC, collective and non-contiguous datatype support, integrated hybrid UD-RC/XRC design, support for UD only mode, nemesis-based interface, shared memory interface, scalable and robust daemon-less job startup (mpirun-rsh), flexible process manager support (mpirun-rsh and hydra.mpiexec), full autoconf-based configuration, portable hardware locality (hwloc) with flexible CPU granularity policies (core, socket and numanode) and binding policies (bunch and scatter) with SMT support, flexible rail binding with processes for multirail configurations, message coalescing, dynamic process migration, fast process-level fault-tolerance with checkpoint-restart, fast job-pause-migration-resume framework for pro-active fault-tolerance, suspend/resume, network-level fault-tolerance with Automatic Path Migration (APM), RDMA CM support, iWARP support, optimized collectives, on-demand connection management, multi-pathing, RDMA Read-based and RDMA-write-based designs, polling and blocking-based communication progress, multi-core optimized and scalable shared memory support, LiMIC2-based kernel-level shared memory support for both two-sided and one-sided operations, shared memory backed Windows for one-Sided communication, HugePage support, and memory hook with ptmalloc2 library support. The ADI-3-level design of MVAPICH2 2.1rc1 supports many features including: MPI-2 functionalities (one-sided, dynamic process management, collectives and datatype), multi-threading and all MPI-1 functionalities. It also supports a wide range of platforms, architectures, OS, compilers, InfiniBand adapters (Mellanox and QLogic), iWARP adapters (including the new Chelsio T4 adapter) and RoCE adapters.

Software Distribution

Link to High Performance MPI on Infiniband Cluster

Journals (13)
1	K. Suresh, K. Khorassani, C. Chen, B. Ramesh, M. Abduljabbar, A. Shafi, H. Subramoni, and DK Panda, Network Assisted Non-Contiguous Transfers for GPU-Aware MPI Libraries, IEEE Micro, Jan 2023.
2	S. Ramesh, A. Mahéo, S. Shende, A. Malony, H. Subramoni, A. Ruhela, and DK Panda, MPI performance engineering with the MPI tool interface: The integration of MVAPICH and TAU, ISSN 0167-8191, Volume 77, Sep 2018.
3	S. Sur, S. Potluri, K. Kandalla, H. Subramoni, K. Tomko, and DK Panda, Co-Designing MPI Library and Applications for InfiniBand Clusters IEEE Computer, Nov 2011.
4	A. Vishnu, M. Koop, A. Moody, A. Mamidala, S. Narravula, and DK Panda, Topology Agnostic Hot-Spot Avoidance with InfiniBand Concurrency and Computation: Practice and Experience, Special Issue of Best Papers from CCGrid '07, Jan 2008.
5	H. Wang, S. Potluri, D. Bureddy, and DK Panda, GPU-Aware MPI on RDMA-Enabled Cluster: Design, Implementation and Evaluation, IEEE Transactions on Parallel & Distributed Systems, Vol. 25, No. 10, pp. 2595-2605, Oct 2014.
6	DK Panda, H. Subramoni, C. Chu, and M. Bayatpour, The MVAPICH project: Transforming Research into High-Performance MPI Library for HPC Community , Journal of Computational Science (JOCS), Special Issue on Translational Computer Science, Oct 2020.
7	C. Chu, X. Lu, Ammar Awan, H. Subramoni, Bracy Elton, and DK Panda, Exploiting Hardware Multicast and GPUDirect RDMA for Efficient Broadcast, IEEE Transactions on Parallel and Distributed Systems (TPDS), vol. 30, no. 3, pp. 575-588, 1 March 2019,
8	Ammar Awan, K. Vadambacheri Manian, C. Chu, H. Subramoni, and DK Panda, Optimized Large-Message Broadcast for Deep Learning Workloads: MPI, MPI+NCCL, or NCCL2?, Volume 85, July 2019, Pages 141-152, https://doi.org/10.1016/j.parco.2019.03.005,
9	S. Chakraborty, Ignacio Laguna, Murali Emani, Kathryn Mohror, DK Panda, Martin Schulz, and H. Subramoni, EReinit: Scalable and Efficient Fault Tolerance for Bulk-Synchronous MPI Applications, Concurrency and Computation: Practice and Experience, 14 August 2018, https://doi.org/10.1002/cpe.4863,
10	T. Tran, B. Ramesh, B. Michalowicz, M. Abduljabbar, H. Subramoni, A. Shafi, and DK Panda, Accelerating Communication with Multi-HCA Aware Collectives in MPI, Concurrency and Computation: Practice and Experience (CCPE), July 2023,
11	A. Ruhela, H. Subramoni, S. Chakraborty, M. Bayatpour, P. Kousha, and DK Panda, Effcient Design for MPI Asynchronous Progress without Dedicated Resources, Parallel Computing - Systems & Applications, Volume 85, July 2019, Pages 13-26, https://doi.org/10.1016/j.parco.2019.03.003,
12	Ammar Awan, A. Jain, C. Chu, H. Subramoni, and DK Panda, Communication Profiling and Characterization of Deep Learning Workloads on Clusters with High-Performance Interconnects, IEEE Micro, vol. 40, no. 1, pp. 35-43, 1 Jan.-Feb. 2020.,
13	J. Hashmi, C. Chu, S. Chakraborty, M. Bayatpour, H. Subramoni, and DK Panda, FALCON-X: Zero-copy MPI Derived Datatype Processing on Modern CPU and GPU Architectures, Journal of Parallel and Distributed Computing (JPDC), Volume 144, October 2020, Pages 1-13, doi.org/10.1016/j.jpdc.2020.05.008,

Conferences & Workshops (335)
1	OMB-CXL: A Micro-Benchmark Suite for Evaluating MPI Communication Utilizing Compute Express Link Memory Devices T. Tran, M. Abduljabbar, H. Ahn, Seonyoung Kim, Yoomi Park, Woojong Han, H. Ahn, H. Subramoni, and DK Panda, PRACTICE & EXPERIENCE IN ADVANCED RESEARCH COMPUTING, Jul 2024 [July 21st to July 25th, 2024 in Providence, RI.] [Bib - Plain]
2	HINT: Designing Cache-Efficient MPI_Alltoall using Hybrid Memory Copy Ordering and Non-Temporal Instructions B. Ramesh, N. Contini, N. Alnaasan, K. Suresh, M. Abduljabbar, A. Shafi, H. Subramoni, and DK Panda, 38th IEEE International Parallel & Distributed Processing Symposium, May 2024 [Bib - Plain]
3	Accelerating MPI AllReduce Communication with Efficient GPU-Based Compression Schemes on Modern GPU Clusters Q. Zhou, B. Ramesh, A. Shafi, M. Abduljabbar, H. Subramoni, and DK Panda, ISC HIGH PERFORMANCE 2024, May 2024 [Bib - Plain]
4	Profiling, Storing and Monitoring HPC Communication Data at Scale by OSU INAM P. Kousha, H. Subramoni, DK Panda, M. Tatineni, and P. Mulrooney, ISC HIGH PERFORMANCE 2024, May 2024 [Research Poster] [Bib - Plain]
5	Optimized All-to-all Connection Establishment for High-Performance MPI Libraries over InifiniBand S. Xu, G. Kuncham, M. Abduljabbar, H. Subramoni, and DK Panda, 30th IEEE INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING, DATA, & ANALYTICS (HiPC'23), Dec 2023 [Bib - Plain]
6	Designing In-network Computing Aware Reduction Collectives in MPI B. Ramesh, G. Kuncham, K. Suresh, R. Vaidya, N. Alnaasan, M. Abduljabbar, A. Shafi, and DK Panda, Hot Interconnects 2023, Aug 2023 [Bib - Plain]
7	Battle of the BlueFields: An In-Depth Comparison of the BlueField-2 and BlueField-3 SmartNICs B. Michalowicz, K. Suresh, H. Subramoni, DK Panda, and S. Poole, Hot Interconnects 2023, Aug 2023 [Bib - Plain]
8	DPU-Bench: A Micro-Benchmark Suite to Measure Offload Efficiency Of SmartNICs B. Michalowicz, K. Suresh, H. Subramoni, DK Panda, and S. Poole, Practice and Experience in Advanced Research Computing 23, Jul 2023 [Bib - Plain]
9	Enabling Reconfigurable HPC through MPI-based Inter-FPGA Communication N. Contini, B. Ramesh, K. Suresh, T. Tran, B. Michalowicz, M. Abduljabbar, H. Subramoni, and DK Panda, International Conference on Supercomputing 2023, Jun 2023 [Bib - Plain]
10	A Novel Framework for Efficient Offloading of Communication Operations to Bluefield SmartNICs K. Suresh, B. Michalowicz, B. Ramesh, N. Contini, J. Yao, S. Xu, A. Shafi, H. Subramoni, and DK Panda, 37th IEEE International Parallel & Distributed Processing Symposium (IPDPS '23), May 2023 [Bib - Plain]
11	Accelerating Distributed Deep Learning Training with Compression Assisted Allgather and Reduce-Scatter Communication Q. Zhou, Q. Anthony, L. Xu, A. Shafi, M. Abduljabbar, H. Subramoni, and DK Panda, 37th IEEE International Parallel & Distributed Processing Symposium (IPDPS '23), May 2023 [Bib - Plain]
12	Designing and Optimizing GPU-aware Nonblocking MPI Neighborhood Collective Communication for PETSc K. Khorassani, C. Chen, H. Subramoni, and DK Panda, 37th IEEE International Parallel & Distributed Processing Symposium (IPDPS '23), May 2023 [Bib - Plain]
13	In-Depth Evaluation of a Lower-Level Direct-Verbs API on InfiniBand-based Clusters: Early Experiences B. Michalowicz, K. Suresh, B. Ramesh, A. Shafi, H. Subramoni, M. Abduljabbar, and DK Panda, 25th Workshop on Advances in Parallel and Distributed Computational Models, May 2023 [Held in conjunction with IPDPS 2023] [Bib - Plain]
14	Efficient Personalized and Non-Personalized Alltoall Communication for Modern Multi-HCA GPU-Based Clusters K. Suresh, A. Paniraja Guptha, B. Michalowicz, B. Ramesh, M. Abduljabbar, A. Shafi, H. Subramoni, and DK Panda, 29th IEEE International Conference on High Performance Computing, Data, and Analytics, Dec 2022 [Bib - Plain]
15	Designing Efficient Pipelined Communication Schemes using Compression in MPI Libraries B. Ramesh, Q. Zhou, A. Shafi, M. Abduljabbar, H. Subramoni, and DK Panda, 29th IEEE International Conference on High Performance Computing, Data, and Analytics, Dec 2022 [Bib - Plain]
16	Accelerating Broadcast Communication with GPU Compression for Deep Learning Workloads Q. Zhou, Q. Anthony, A. Shafi, H. Subramoni, and DK Panda, 29th IEEE International Conference on High Performance Computing, Data, and Analytics, Dec 2022 [Bib - Plain]
17	Designing Hierarchical Multi-HCA Aware Allgather in MPI T. Tran, B. Michalowicz, B. Ramesh, H. Subramoni, A. Shafi, and DK Panda, Fifteenth International Workshop on Parallel Programming Models and Systems Software for High-End Computing (P2S2), 2022, Aug 2022 [To be held in conjunction with ICPP 2022: The 51st International Conference on Parallel Processing August 29th to Sept 1st, 2022 in Bordeaux, France] [Bib - Plain]
18	Network-Assisted Non-Contiguous Transfers for GPU-Aware MPI Libraries K. Suresh, K. Khorassani, C. Chen, B. Ramesh, M. Abduljabbar, A. Shafi, and DK Panda, Hot Interconnects 29, Aug 2022 [Bib - Plain]
19	Arm meets Cloud: A Case Study of MPI Library Performance on AWS Arm-based HPC Cloud with Elastic Fabric Adapter S. Xu, A. Shafi, H. Subramoni, and DK Panda, 24th Workshop on Advances in Parallel and Distributed Computational Models, May 2022 [Bib - Plain]
20	Towards Java-based HPC using the MVAPICH2 Library: Early Experiences K. Al Attar, A. Shafi, H. Subramoni, and DK Panda, HIPS '22 (IPDPSW), May 2022 [Bib - Plain]
21	Highly Efficient Alltoall and Alltoallv Communication Algorithms for GPU Systems C. Chen, K. Khorassani, Q. Anthony, A. Shafi, H. Subramoni, and DK Panda, Heterogeneity in Computing Workshop (HCW 2022), May 2022 [held in conjunction with IPDPS'22] [Bib - Plain]
22	OMB-Py: Python Micro-Benchmarks for Evaluating Performance of MPI Libraries on HPC Systems N. Alnaasan, A. Jain, A. Shafi, H. Subramoni, and DK Panda, 23rd Parallel and Distributed Scientific and Engineering Computing Workshop (PDSEC) at IPDPS22, May 2022 [Bib - Plain]
23	Hy-Fi: Hybrid Five-Dimensional Parallel DNN Training on High-Performance GPU Clusters A. Jain, A. Shafi, Q. Anthony, P. Kousha, H. Subramoni, and DK Panda, ISC HIGH PERFORMANCE 2022, May 2022 [Bib - Plain]
24	Accelerating MPI All-to-All Communication with Online Compression on Modern GPU Clusters Q. Zhou, P. Kousha, Q. Anthony, K. Khorassani, A. Shafi, H. Subramoni, and DK Panda, ISC HIGH PERFORMANCE 2022, May 2022 [Bib - Plain]
25	Layout aware Hardware assisted Designs for Derived Data Types in MPI K. Suresh, B. Ramesh, C. Chen, M. Ghazimirsaeed, M. Bayatpour, A. Shafi, H. Subramoni, and DK Panda, 28th IEEE International Conference on High Performance Computing, Data, Analytics, and Data Science, Dec 2021 [Bib - Plain]
26	Towards Architecture-aware Hierarchical Communication Trees on Modern HPC Systems B. Ramesh, J. Hashmi, S. Xu, A. Shafi, M. Ghazimirsaeed, M. Bayatpour, H. Subramoni, and DK Panda, 28th IEEE International Conference on High Performance Computing, Data, Analytics, and Data Science, Dec 2021 [Best Paper Finalist] [Bib - Plain]
27	Large-Message Nonblocking MPI_Iallgather and MPI_Ibcast Offload via BlueField-2 DPU N. Sarkauskas, M. Bayatpour, T. Tran, B. Ramesh, H. Subramoni, and DK Panda, 28th IEEE International Conference on High Performance Computing, Data, Analytics, and Data Science, Dec 2021 [Short Paper] [Bib - Plain]
28	Accelerating CPU-based Distributed DNN Training on Modern HPC Clusters using BlueField-2 DPUs A. Jain, N. Alnaasan, A. Shafi, H. Subramoni, and DK Panda, 28th IEEE Hot Interconnects, Aug 2021 [Bib - Plain]
29	BluesMPI: Efficient MPI Non-blocking Alltoall Offloading Designs on Modern BlueField Smart NICs M. Bayatpour, N. Sarkauskas, H. Subramoni, J. Hashmi, and DK Panda, ISC HIGH PERFORMANCE 2021, Jun 2021 [Bib - Plain]
30	Designing a ROCm-aware MPI Library for AMD GPUs: Early Experiences K. Khorassani, J. Hashmi, C. Chu, C. Chen, H. Subramoni, and DK Panda, ISC HIGH PERFORMANCE 2021, Jun 2021 [Bib - Plain]
31	SUPER: SUb-Graph Parallelism for TransformERs A. Jain, T. Moon, T. Benson, H. Subramoni, S. Jacobs, DK Panda, and B. Essen, 35th IEEE International Parallel & Distributed Processing Symposium, May 2021 [Bib - Plain]
32	Designing High-Performance MPI Libraries with On-the-fly Compression for Modern GPU Clusters Q. Zhou, C. Chu, N. Senthil Kumar, P. Kousha, M. Ghazimirsaeed, H. Subramoni, and DK Panda, 35th IEEE International Parallel & Distributed Processing Symposium, May 2021 [Best Paper Finalist] [Bib - Plain]
33	Adaptive and Hierarchical Large Message All-to-all Communication Algorithms for Large-scale Dense GPU Systems K. Khorassani, C. Chu, Q. Anthony, H. Subramoni, and DK Panda, The 21st IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing, May 2021 [Bib - Plain]
34	A Hierarchical and Load-Aware Design for Large Message Neighborhood Collectives M. Ghazimirsaeed, Q. Zhou, A. Ruhela, M. Bayatpour, H. Subramoni, and DK Panda, SC 2020, Nov 2020 [Bib - Plain]
35	GEMS: GPU Enabled Memory Aware Model Parallelism System for Distributed DNN Training A. Jain, Ammar Awan, A. Aljuhani, J. Hashmi, Q. Anthony, H. Subramoni, DK Panda, R. Machiraju, and A. Parwani, SC 2020, Nov 2020 [Bib - Plain]
36	Scalable MPI Collectives using SHARP: Large Scale Performance Evaluation on the TACC Frontera System B. Ramesh, K. Suresh, N. Sarkauskas, M. Bayatpour, J. Hashmi, H. Subramoni, and DK Panda, ExaMPI2020 - Workshop on Exascale MPI 2020, Nov 2020 [Bib - Plain]
37	MPI Meets Cloud: Case Study with Amazon EC2 and Microsoft Azure S. Xu, M. Ghazimirsaeed, J. Hashmi, H. Subramoni, and DK Panda, 4th Workshop on Emergine Parallel and Distributed Runtime Systems and Middlewares, Nov 2020 [Bib - Plain]
38	Exploring Hybrid MPI+Kokkos Tasks Programming Model Samuel Khuvis, K. Tomko, J. Hashmi, and DK Panda, The 3rd Annual Parallel Applications Workshop, Alternatives to MPI+X (PAW-ATM), Nov 2020 [held in conjunction with SC’20] [Bib - Plain]
39	Design and Characterization of Infiniband Hardware Tag Matching in MPI M. Bayatpour, M. Ghazimirsaeed, S. Xu, H. Subramoni, and DK Panda, The 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing, Nov 2020 [Bib - Plain]
40	Dynamic Kernel Fusion for Bulk Non-contiguous Data Transfer on GPU Clusters C. Chu, K. Khorassani, Q. Zhou, H. Subramoni, and DK Panda, 22nd IEEE International Conference on Cluster Computing (IEEE Cluster 2020), Sep 2020 [Bib - Plain]
41	NV-Group: Link-Efficient Reductions for Distributed Deep Learning on Modern Dense GPU Systems C. Chu, P. Kousha, Ammar Awan, K. Khorassani, H. Subramoni, and DK Panda, The 34th ACM International Conference on Supercomputing (ICS-2020), Jun 2020 [Bib - Plain]
42	Communication-Aware Hardware-Assisted MPI Overlap Engine M. Bayatpour, J. Hashmi, S. Chakraborty, K. Suresh, M. Ghazimirsaeed, B. Ramesh, H. Subramoni, and DK Panda, ISC HIGH PERFORMANCE 2020, Jun 2020 [Bib - Plain]
43	HyPar-Flow: Exploiting MPI and Keras for Scalable Hybrid-Parallel DNN Training with TensorFlow Ammar Awan, A. Jain, Q. Anthony, H. Subramoni, and DK Panda, ISC HIGH PERFORMANCE 2020, Jun 2020 [Bib - Plain]
44	Machine-agnostic and Communication-aware Designs for MPI on Emerging Architectures J. Hashmi, S. Xu, B. Ramesh, M. Bayatpour, H. Subramoni, and DK Panda, 34th IEEE International Parallel & Distributed Processing Symposium (IPDPS '20), May 2020 [Bib - Plain]
45	Performance Characterization of Network Mechanisms for Non-Contiguous Data Transfers in MPI K. Suresh, B. Ramesh, M. Ghazimirsaeed, M. Bayatpour, J. Hashmi, H. Subramoni, and DK Panda, Workshop on Scalable Networks for Advanced Computing Systems (SNACS) at IPDPS '20, May 2020 [Bib - Plain]
46	Analyzing and Understanding the Impact of Interconnect Performance on HPC, Big Data, and Deep Learning Applications: A Case Study with InfiniBand EDR and HDR A. Ruhela, S. Xu, K. Vadambacheri Manian, H. Subramoni, and DK Panda, Workshop on Scalable Networks for Advanced Computing Systems (SNACS) at IPDPS '20, May 2020 [Bib - Plain]
47	High-Performance Adaptive MPI Derived Datatype Communication for Modern Multi-GPU Systems C. Chu, J. Hashmi, K. Khorassani, H. Subramoni, and DK Panda, 26th IEEE International Conference on High Performance Computing, Data, Analytics and Data Science (HiPC '19), Dec 2019 [Bib - Plain]
48	Design and Evaluation of Shared Memory CommunicationBenchmarks on Emerging Architectures using MVAPICH2 S. Xu, J. Hashmi, S. Chakraborty, H. Subramoni, and DK Panda, Third Annual Workshop on Emerging Parallel and Distributed Runtime Systems and Middleware, held in conjunction with SC '19, Nov 2019 [Bib - Plain]
49	OMB-UM: Design, Implementation, and Evaluation of CUDA Unified Memory Aware MPI Benchmarks K. Vadambacheri Manian, C. Chu, Ammar Awan, K. Khorassani, H. Subramoni, and DK Panda, 10th International Workshop in Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems, Nov 2019 [Bib - Plain]
50	Scaling TensorFlow, PyTorch, and MXNet using MVAPICH2 for High-Performance Deep Learning on Frontera A. Jain, Ammar Awan, H. Subramoni, and DK Panda, 3rd Deep Learning on Supercomputers Workshop (DLS) at SC19, Nov 2019 [Bib - Plain]
51	Performance Characterization of DNN Training using TensorFlow and PyTorch on Modern Clusters A. Jain, Ammar Awan, Q. Anthony, H. Subramoni, and DK Panda, 21st IEEE International Conference on Cluster Computing, Sep 2019 [Bib - Plain]
52	Designing Scalable and High-performance MPI Libraries on Amazon Elastic Fabric Adapter S. Chakraborty, S. Xu, H. Subramoni, and DK Panda, HOT Interconnects 26, Aug 2019 [Bib - Plain]
53	Performance Evaluation of MPI Libraries on GPU-enabled OpenPOWER Architectures: Early Experiences K. Khorassani, C. Chu, H. Subramoni, and DK Panda, International Workshop on OpenPOWER for HPC, held in conjunction with ISC'19, Jun 2019 [Bib - Plain]
54	Reduction Operations on Modern Supercomputers: Challenges and Solutions M. Bayatpour, J. Hashmi, S. Chakraborty, H. Subramoni, and DK Panda, ISC HIGH PERFORMANCE 2019, Jun 2019 [Best Poster Award] [Bib - Plain]
55	FALCON: Efficient Designs for Zero-copy MPI Datatype Processing on Emerging Architectures J. Hashmi, S. Chakraborty, M. Bayatpour, H. Subramoni, and DK Panda, 33rd IEEE International Parallel & Distributed Processing Symposium (IPDPS '19), May 2019 [Best Paper Finalist] [Bib - Plain]
56	C-GDR: High-Performance Container-aware GPUDirect MPI Communication Schemes on RDMA Networks J. Zhang, X. Lu, C. Chu, and DK Panda, 33rd IEEE International Parallel & Distributed Processing Symposium (IPDPS '19), May 2019 [Bib - Plain]
57	Design and Characterization of Shared Address Space MPI Collectives on Modern Architectures J. Hashmi, S. Chakraborty, M. Bayatpour, H. Subramoni, and DK Panda, The 19th Annual IEEE/ACM International Symposium in Cluster, Cloud, and Grid Computing (CCGRID 2019), May 2019 [Bib - Plain]
58	Scalable Distributed DNN Training using TensorFlow and CUDA-Aware MPI: Characterization, Designs, and Performance Evaluation Ammar Awan, J. Bedorf, C. Chu, H. Subramoni, and DK Panda, The 19th Annual IEEE/ACM International Symposium in Cluster, Cloud, and Grid Computing (CCGRID 2019), May 2019 [Bib - Plain]
59	Characterizing CUDA Unified Memory (UM)-AwareMPI Designs on Modern GPU Architectures K. Vadambacheri Manian, Ammar Awan, A. Ruhela, C. Chu, and DK Panda, 12th Workshop on General Purpose Processing Using GPU (GPGPU 2019) @ ASPLOS 2019, Apr 2019 [Bib - Plain]
60	Cooperative Rendezvous Protocols for Improved Performance and Overlap S. Chakraborty, M. Bayatpour, J. Hashmi, H. Subramoni, and DK Panda, 2018 The International Conference for High Performance Computing, Networking, Storage, and Analysis, Nov 2018 [Best Student Paper Finalist] [Bib - Plain]
61	Optimized Broadcast for Deep Learning Workloads on Dense-GPU InfiniBand Clusters: MPI or NCCL? Ammar Awan, C. Chu, H. Subramoni, and DK Panda, The EuroMPI 2018 Conference, Sep 2018 [Bib - Plain]
62	Multi-Threading and Lock-Free MPI RMA Based Graph Processing on KNL and POWER Architectures M. Li, X. Lu, H. Subramoni, and DK Panda, The EuroMPI 2018 Conference, Sep 2018 [Bib - Plain]
63	Efficient Asynchronous Communication Progress for MPI without Dedicated Resources A. Ruhela, H. Subramoni, S. Chakraborty, M. Bayatpour, P. Kousha, and DK Panda, EuroMPI 2018, Sep 2018 [Bib - Plain]
64	SALaR: Scalable and Adaptive Designs for Large Message Reduction Collectives M. Bayatpour, J. Hashmi, S. Chakraborty, H. Subramoni, P. Kousha, and DK Panda, IEEE Cluster 2018, Sep 2018 [Best Paper Award] [Bib - Plain]
65	Designing Efficient Shared Address Space Reduction Collectives for Multi-/Many-cores J. Hashmi, S. Chakraborty, M. Bayatpour, H. Subramoni, and DK Panda, 32nd IEEE International Parallel & Distributed Processing Symposium (IPDPS '18), May 2018 [Bib - Plain]
66	Kernel-assisted Communication Engine for MPI on Emerging Manycore Processors J. Hashmi, K. Hamidouche, H. Subramoni, and DK Panda, 24th IEEE International Conference on High Performance Computing, Data, and Analytics (HiPC'17), Dec 2017 [Bib - Plain]
67	Designing Registration Caching Free High-Performance MPI Library with Implicit On-Demand Paging (ODP) of InfiniBand M. Li, X. Lu, H. Subramoni, and DK Panda, 24th IEEE International Conference on High Performance Computing, Data, and Analytics (HiPC'17), Dec 2017 [Bib - Plain]
68	MPI-LiFE: Designing High-Performance Linear Fascicle Evaluation of Brain Connectome with MPI , X. Lu, F. Pestilli, C.F. Caiafa, and DK Panda, 24th IEEE International Conference on High Performance Computing, Data, and Analytics (HiPC'17), Dec 2017 [Bib - Plain]
69	An In-depth Performance Characterization of CPU- and GPU-based DNN Training on Modern Architectures Ammar Awan, H. Subramoni, and DK Panda, 3rd Workshop on Machine Learning in High Performance Computing Environments, held in conjunction with SC17, Nov 2017 [Bib - Plain]
70	Scalable Reduction Collectives with Data Partitioning-based Multi-Leader Design M. Bayatpour, S. Chakraborty, H. Subramoni, X. Lu, and DK Panda, SuperComputing 2017, Nov 2017 [Bib - Plain]
71	Performance of PGAS Models on KNL: A Comprehensive Study with MVAPICH2-X J. Hashmi, M. Li, H. Subramoni, and DK Panda, Intel Xeon Phi User's Group (IXPUG) 2017, Sep 2017 [Bib - Plain]
72	Advancing MPI Libraries to the Many-core Era: Designs and Evaluations with MVAPICH2 S. Chakraborty, M. Bayatpour, H. Subramoni, and DK Panda, Intel Xeon Phi User's Group (IXPUG) 2017, Sep 2017 [Bib - Plain]
73	MPI Performance Engineering with the MPI Tool Interface: the Integration of MVAPICH and TAU DK Panda, 24th European MPI Users' Group Meeting, Sep 2017 [Best Paper] [Bib - Plain]
74	Contention Aware Kernel-Assisted MPI Collectives for Multi/Many-core Systems S. Chakraborty, H. Subramoni, and DK Panda, 2017 IEEE International Conference on Cluster Computing, Sep 2017 [Best Paper Finalist] [Bib - Plain]
75	Efficient and Scalable Multi-Source Streaming Broadcast on GPU Clusters for Deep Learning C. Chu, X. Lu, Ammar Awan, H. Subramoni, J. Hashmi, Bracy Elton, and DK Panda, ICPP 2017 : International Conference on Parallel Processing, Aug 2017 [Bib - Plain]
76	MPI-GDS: High Performance MPI Designs with GPUDirect-aSync for CPU-GPU Control Flow Decoupling A. Venkatesh, C. Chu, K. Hamidouche, S. Potluri, Davide Rossetti, and DK Panda, ICPP 2017 : International Conference on Parallel Processing, Aug 2017 [Bib - Plain]
77	Designing Dynamic and Adaptive MPI Point-to-point Communication Protocols for Efficient Overlap of Computation and Communication H. Subramoni, S. Chakraborty, and DK Panda, International Supercomputing Conference (ISC ’17), Jun 2017 [Hans Meuer Award (Most Outstanding Research Paper)] [Bib - Plain]
78	S-Caffe: Co-designing MPI Runtimes and Caffe for Scalable Deep Learning on Modern GPU Clusters Ammar Awan, K. Hamidouche, J. Hashmi, and DK Panda, 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Feb 2017 [Slides] [Bib - Plain]
79	Mizan-RMA: Accelerating Mizan Graph Processing Framework with MPI RMA M. Li, X. Lu, K. Hamidouche, J. Zhang, and DK Panda, 23rd IEEE International Conference on High Performance Computing, Data, and Analytics, Dec 2016 [Bib - Plain]
80	Designing MPI Library with On-Demand Paging (ODP) of InfiniBand: Challenges and Benefits M. Li, K. Hamidouche, X. Lu, H. Subramoni, J. Zhang, and DK Panda, SuperComputing 2016, Nov 2016 [Bib - Plain]
81	Efficient Large Message Broadcast using NCCL and CUDA-Aware MPI for Deep Learning Ammar Awan, K. Hamidouche, A. Venkatesh, and DK Panda, The 23rd European MPI Users' Group Meeting (EuroMPI 16), Sep 2016 [Best Paper Runner-Up] [Bib - Plain]
82	Adaptive and Dynamic Design for MPI Tag Matching M. Bayatpour, H. Subramoni, S. Chakraborty, and DK Panda, IEEE Cluster 2016, Sep 2016 [Best Paper Nominee] [Bib - Plain]
83	INAM^2: InfiniBand Network Analysis & Monitoring with MPI H. Subramoni, A. Augustine, M. Arnold, J. Perkins, X. Lu, K. Hamidouche, and DK Panda, International Supercomputing Conference, Jun 2016 [Slides] [Bib - Plain]
84	SHMEMPMI - Shared Memory based PMI for Improved Performance and Scalability S. Chakraborty, H. Subramoni, J. Perkins, and DK Panda, 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid'16), May 2016 [Bib - Plain]
85	A Case for Application-Oblivious Energy-Efficient MPI Runtime A. Venkatesh, A. Vishnu, K. Hamidouche, N. Tallent, DK Panda, D. Kerbyson, and A. Hoise, Supercomputing 2015, Nov 2015 [Best Student Paper Finalist] [Bib - Plain]
86	GPU-Aware Design, Implementation, and Evaluation of Non-blocking Collective Benchmarks Ammar Awan, K. Hamidouche, A. Venkatesh, J. Perkins, H. Subramoni, and DK Panda, EuroMPI 2015, Sep 2015 [Bib - Plain]
87	High Performance MPI Datatype Support with User-mode Memory Registration: Challenges, Designs and Benefits M. Li, H. Subramoni, K. Hamidouche, X. Lu, and DK Panda, IEEE Cluster 2015, Sep 2015 [Bib - Plain]
88	Impact of InfiniBand DC Transport Protocol on Energy Consumption of All-to-all Collective Algorithms H. Subramoni, A. Venkatesh, K. Hamidouche, K. Tomko, and DK Panda, 23rd International Symposium on High Performance Interconnects 2015, Aug 2015 [Bib - Plain]
89	Designing Non-Blocking Personalized Collectives with Near Perfect Overlap for RDMA-Enabled Clusters H. Subramoni, Ammar Awan, K. Hamidouche, D. Pekurovsky, A. Venkatesh, S. Chakraborty, K. Tomko, and DK Panda, ISC '15, Jul 2015 [Bib - Plain]
90	On-demand Connection Management for OpenSHMEM and OpenSHMEM+MPI S. Chakraborty, H. Subramoni, J. Perkins, Ammar Awan, and DK Panda, HIPS '15 (IPDPS Workshop), May 2015 [Bib - Plain]
91	High-Performance Coarray Fortran Support with MVAPICH2-X: Initial Experience and Evaluation J. Lin, K. Hamidouche, X. Lu, M. Li, and DK Panda, HIPS '15 (IPDPS Workshop), May 2015 [Bib - Plain]
92	Non-blocking PMI Extensions for Fast MPI Startup S. Chakraborty, H. Subramoni, A. Moody, A. Venkatesh, J. Perkins, and DK Panda, CCGrid '15, May 2015 [Bib - Plain]
93	MVAPICH2 over OpenStack with SR-IOV: An Efficient Approach to Build HPC Clouds J. Zhang, X. Lu, M. Arnold, and DK Panda, CCGrid '15, May 2015 [Bib - Plain]
94	Power-Check: An Energy-Efficient Checkpointing Framework for HPC Clusters R. Rajachandrasekar, A. Venkatesh, K. Hamidouche, and DK Panda, CCGrid '15, May 2015 [Bib - Plain]
95	Designing High Performance Communication Runtime for GPU Managed Memory: Early Experiences D. Banerjee, K. Hamidouche, and DK Panda, General Purpose GPU (GPGPU-9), Mar 2015 [Bib - Plain]
96	High Performance MPI Library over SR-IOV Enabled InfiniBand Clusters J. Zhang, X. Lu, J. Jose, M. Li, R. Shi, and DK Panda, International Conference on High Performance Computing (HiPC'14), Dec 2014 [Bib - Plain]
97	Designing Efficient Small Message Transfer Mechanism for Inter-node MPI Communication on InfiniBand GPU Clusters R. Shi, S. Potluri, K. Hamidouche, M. Li, J. Perkins, D. Rossetti, and DK Panda, IEEE International Conference on High Performance Computing (HiPC ’14), Dec 2014 [Bib - Plain]
98	A High Performance Broadcast Design with Hardware Multicast and GPUDirect RDMA for Streaming Applications on Infiniband Clusters A. Venkatesh, H. Subramoni, K. Hamidouche, and DK Panda, IEEE International Conference on High Performance Computing (HiPC ’14), Dec 2014 [Bib - Plain]
99	Scalable MiniMD Design with Hybrid MPI and OpenSHMEM M. Li, J. Lin, X. Lu, K. Hamidouche, K. Tomko, and DK Panda, OUG '14 (Co-located with PGAS), Oct 2014 [Bib - Plain]
100	Designing Scalable Out-of-core Sorting with Hybrid MPI+PGAS Programming Models J. Jose, S. Potluri, H. Subramoni, X. Lu, K. Hamidouche, K. Schulz, H. Sundar, and DK Panda, International Conference on Partitioned Global Address Space Programming Models (PGAS '14), Oct 2014 [Bib - Plain]
101	PMI Extensions for Scalable MPI Startup S. Chakraborty, H. Subramoni, J. Perkins, A. Moody, M. Arnold, and DK Panda, EuroMPI/ASIA 2014, Sep 2014 [Bib - Plain]
102	Understanding the Memory-Utilization of MPI Libraries: Challenges and Designs in Implementing the MPI_T Interface R. Rajachandrasekar, J. Perkins, K. Hamidouche, M. Arnold, and DK Panda, EuroMPI/ASIA 2014, Sep 2014 [Bib - Plain]
103	HAND: A Hybrid Approach to Accelerate Non-contiguous Data Movement using MPI Datatypes on GPU Clusters R. Shi, X. Lu, S. Potluri, K. Hamidouche, J. Zhang, and DK Panda, International Conference on Parallel Processing (ICPP’14), Sep 2014 [Bib - Plain]
104	Designing Topology-Aware Communication Schedules for Alltoall Operations in Large InfiniBand Clusters H. Subramoni, K. Kandalla, J. Jose, K. Tomko, K. Schulz, D. Pekurovsky, and DK Panda, International Conference on Parallel Processing (ICPP’14), Sep 2014 [Bib - Plain]
105	High Performance OpenSHMEM for MIC Clusters: Extensions, Runtime Designs, and Application Co-Design J. Jose, K. Hamidouche, X. Lu, S. Potluri, J. Zhang, K. Tomko, and DK Panda, IEEE CLUSTER’14, Sep 2014 [Bib - Plain]
106	Scalable Graph500 Design with MPI-3 RMA M. Li, X. Lu, S. Potluri, K. Hamidouche, J. Jose, K. Tomko, and DK Panda, IEEE CLUSTER’14, Sep 2014 [Bib - Plain]
107	Can Inter-VM Shmem Benefit MPI Applications on SR-IOV based Virtualized InfiniBand Clusters? J. Zhang, X. Lu, J. Jose, R. Shi, and DK Panda, Euro-Par 2014 Parallel Processing, Aug 2014 [Bib - Plain]
108	MIC-Check: A Distributed Checkpointing Framework for the Intel Many Integrated Cores Architecture R. Rajachandrasekar, S. Potluri, A. Venkatesh, K. Hamidouche, M. W. Rahman, and DK Panda, International Symposium on High Performance and Distributed Computing (HPDC), Jun 2014 [Bib - Plain]
109	Designing MPI Library with Dynamic Connected Transport (DCT) of InfiniBand : Early Experiences H. Subramoni, K. Hamidouche, A. Venkatesh, S. Chakraborty, and DK Panda, IEEE International Supercomputing Conference (ISC ’14), Jun 2014 [Bib - Plain]
110	High Performance Alltoall and Allgather designs for InfiniBand MIC Clusters A. Venkatesh, S. Potluri, R. Rajachandrasekar, M. Luo, K. Hamidouche, and DK Panda, International Parallel and Distributed Processing Symposium (IPDPS’14), May 2014 [Bib - Plain]
111	Optimizing Collective Communication in UPC J. Jose, K. Hamidouche, J. Zhang, A. Venkatesh, and DK Panda, International Workshop on High-Level Parallel Programming Models and Supportive Environments (HIPS '14), May 2014 [Slides] [Bib - Plain]
112	A Comprehensive Performance Evaluation of OpenSHMEM Libraries on InfiniBand Clusters J. Jose, J. Zhang, A. Venkatesh, S. Potluri, and DK Panda, OpenSHMEM Workshop, Mar 2014 [Bib - Plain]
113	Initial Study of Multi-Endpoint Runtime for MPI+OpenMP Hybrid Programming Model on Multi-Core Systems M. Luo, X. Lu, K. Hamidouche, K. Kandalla, and DK Panda, International Symposium on Principles and Practice of Parallel Programming (PPoPP '14), Feb 2014 [Bib - Plain]
114	The MVAPICH Project: Evolution and Sustainability of an Open Source Production Quality MPI Library for HPC DK Panda, K. Tomko, K. Schulz, and A. Majumdar, Int'l Workshop on Sustainable Software for Science: Practice and Experiences, Nov 2013 [Bib - Plain]
115	MVAPICH-PRISM: A Proxy-based Communication Framework using InfiniBand and SCIF for Intel MIC Clusters S. Potluri, D. Bureddy, K. Hamidouche, A. Venkatesh, K. Kandalla, H. Subramoni, and DK Panda, Internationall Conference on Supercomputing, Nov 2013 [Bib - Plain]
116	A Novel Functional Partitioning Approach to Design High-Performance MPI-3 Non-Blocking Alltoallv Collective on Multi-core Systems K. Kandalla, H. Subramoni, K. Tomko, D. Pekurovsky, and DK Panda, International Conference on Parallel Processing (ICPP '13), Oct 2013 [Bib - Plain]
117	UPC on MIC: Early Experiences with Native and Symmetric Modes M. Luo, M. Li, A. Venkatesh, X. Lu, and DK Panda, International Conference on Partitioned Global Address Space Programming Models (PGAS '13), Oct 2013 [Bib - Plain]
118	Optimizing Collective Communication in OpenSHMEM J. Jose, K. Kandalla, S. Potluri, J. Zhang, and DK Panda, International Conference on Partitioned Global Address Space Programming Models (PGAS '13), Oct 2013 [Bib - Plain]
119	Efficient Inter-node MPI Communication using GPUDirect RDMA for InfiniBand Clusters with NVIDIA GPUs S. Potluri, K. Hamidouche, A. Venkatesh, D. Bureddy, and DK Panda, International Conference on Parallel Processing 2013, Oct 2013 [Bib - Plain]
120	Design of Network Topology Aware Scheduling Services for Large InfiniBand Clusters H. Subramoni, D. Bureddy, K. Kandalla, K. Schulz, B. Barth, J. Perkins, M. Arnold, and DK Panda, IEEE Cluster (Cluster '13), Sep 2013 [Bib - Plain]
121	A Scalable and Portable Approach to Accelerate Hybrid HPL on Heterogeneous CPU-GPU Clusters R. Shi, S. Potluri, K. Hamidouche, X. Lu, K. Tomko, and DK Panda, IEEE Cluster (Cluster '13), Sep 2013 [Bib - Plain]
122	Efficient and Truly Passive MPI-3 RMA Using InfiniBand Atomics M. Li, S. Potluri, K. Hamidouche, J. Jose, and DK Panda, EuroMPI 2013, Sep 2013 [Slides] [Bib - Plain]
123	Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters K. Kandalla, A. Venkatesh, K. Hamidouche, S. Potluri, and DK Panda, International Symposium on High-Performance Interconnects (HotI '13), Aug 2013 [Bib - Plain]
124	MVAPICH2-MIC: A High-Performance MPI Library for Xeon Phi Clusters with InfiniBand S. Potluri, K. Hamidouche, D. Bureddy, and DK Panda, Extreme Scaling Workshop, Aug 2013 [Bib - Plain]
125	Optimized MPI Gather collective for Many Integrated Core (MIC) InfiniBand Clusters A. Venkatesh, K. Kandalla, and DK Panda, Extreme Scaling Workshop, Aug 2013 [Bib - Plain]
126	A 1PB/s File System to Checkpoint Three Million MPI Tasks R. Rajachandrasekar, A. Moody, K. Mohror, and DK Panda, International Conference on High Performance Distributed Computing (HPDC '13), Jun 2013 [Slides] [Bib - Plain]
127	Designing Scalable Graph500 Benchmark with Hybrid MPI+OpenSHMEM Programming Models J. Jose, S. Potluri, K. Tomko, and DK Panda, International Supercomputing Conference (ISC '13), Jun 2013 [Slides] [Bib - Plain]
128	MIC-RO: Enabling Efficient Remote Offload on Heterogeneous Many Integrated Core (MIC) Clusters with InfiniBand K. Hamidouche, S. Potluri, H. Subramoni, K. Kandalla, and DK Panda, International Conference on Supercomputing (ICS '13), Jun 2013 [Bib - Plain]
129	Extending OpenSHMEM for GPU Computing S. Potluri, D. Bureddy, H. Wang, H. Subramoni, and DK Panda, International Parallel and Distributed Processing Symposium (IPDPS '13), May 2013 [Slides] [Bib - Plain]
130	Evaluation of Energy Characteristics of MPI Communication Primitives with RAPL A. Venkatesh, K. Kandalla, and DK Panda, International Workshop on High Performance (High-Performance, Power-Aware Computing Workshop), May 2013 [Bib - Plain]
131	Design of a Scalable InfiniBand Topology Service to Enable Network-Topology-Aware Placement of Processes H. Subramoni, S. Potluri, K. Kandalla, B. Barth, J. Vienne, J. Keasler, K. Tomko, K. Schulz, A. Moody, and DK Panda, International Conference on Supercomputing (SC '12), Nov 2012 [Bib - Plain]
132	Multi-Threaded UPC Runtime for GPU to GPU communication over InfiniBand M. Luo, H. Wang, and DK Panda, International Conference on Partitioned Global Address Space Programming Models (PGAS '12), Oct 2012 [Slides] [Bib - Plain]
133	Supporting Hybrid MPI and OpenSHMEM over InfiniBand: Design and Performance Evaluation J. Jose, K. Kandalla, M. Luo, and DK Panda, International Conference on Parallel Processing (ICPP '12), Sep 2012 [Bib - Plain]
134	OMB-GPU: A Micro-benchmark suite for Evaluating MPI Libraries on GPU Clusters D. Bureddy, H. Wang, A. Venkatesh, S. Potluri, and DK Panda, EuroMPI 2012, Sep 2012 [Bib - Plain]
135	Minimizing Network Contention in InfiniBand Clusters with a QoS-Aware Data-Staging Framework R. Rajachandrasekar, J. Jaswani, H. Subramoni, and DK Panda, IEEE Cluster (Cluster '12), Sep 2012 [Bib - Plain]
136	Can Network-Offload based Non-Blocking Neighborhood MPI Collectives Improve Communication Overheads of Irregular Graph Algorithms? Int'l Workshop on Parallel Algorithm and Parallel Software (IWPAPS12) K. Kandalla, H. Subramoni, K. Tomko, J. Vienne, L. Oliker, and DK Panda, held in conjunction with IEEE Cluster (Cluster '12), Sep 2012 [Bib - Plain]
137	A Scalable InfiniBand Network-Topology-Aware Performance Analysis Tool for MPI H. Subramoni, J. Vienne, and DK Panda, International Workshop on Productivity and Performance (Proper '12), Aug 2012 [Bib - Plain]
138	Performance Analysis and Evaluation of InfiniBand FDR and 40GigE RoCE on HPC and Cloud Computing System J. Vienne, J. Chen, M. W. Rahman, N. Islam, H. Subramoni, and DK Panda, International Symposium on High-Performance Interconnects (HotI 2012), Aug 2012 [Bib - Plain]
139	Congestion Avoidance on Manycore High Performance Computing Systems M. Luo, DK Panda, C. Iancu, and K. Z. Ibrahim, International Conference on Supercomputing (ICS '12), Jun 2012 [Bib - Plain]
140	Redesigning MPI Shared Memory Communication for Large Multi-Core Architecture M. Luo, H. Wang, J. Vienne, and DK Panda, International Supercomputing Conference 2012, Jun 2012 [Bib - Plain]
141	Designing Non-blocking Allreduce with Collective Offload on InfiniBand Clusters: A Case Study with Conjugate Gradient Solvers K. Kandalla, U. Yang, J. Keasler, T. Kolev, A. Moody, H. Subramoni, K. Tomko, J. Vienne, and DK Panda, International Parallel and Distributed Processing Symposium 2012, May 2012 [Bib - Plain]
142	Designing Network Failover and Recovery in MPI for Multi-Rail InfiniBand Clusters S. P. Raikar, H. Subramoni, K. Kandalla, J. Vienne, and DK Panda, International Workshop on System Management Techniques, May 2012 [Bib - Plain]
143	Monitoring and Predicting Hardware Failures in HPC Clusters with FTB-IPMI R. Rajachandrasekar, X. Besseron, and DK Panda, International Workshop on System Management Techniques, May 2012 [Bib - Plain]
144	Optimizing MPI Communication on Multi-GPU Systems using CUDA Inter-Process Communication S. Potluri, H. Wang, D. Bureddy, A. Singh, C. Rosales, and DK Panda, International Workshop on Accelerators and Hybrid Exascale Systems (AsHES), May 2012 [Slides] [Bib - Plain]
145	Intra-MIC MPI Communication using MVAPICH2: Early Experience S. Potluri, K. Tomko, D. Bureddy, and DK Panda, TACC-Intel Highly-Parallel Computing Symposium, Apr 2012 [Slides] [Bib - Plain]
146	Multi-threaded UPC Runtime with Network Endpoints: Design Alternatives and Evaluation on Multi-core Architectures M. Luo, J. Jose, S. Sur, and DK Panda, International Conference on High Performance Computing (HiPC '11), Dec 2011 [Slides] [Bib - Plain]
147	UPC Queues for Scalable Graph Traversals: Design and Evaluation on InfiniBand Clusters J. Jose, S. Potluri, M. Luo, S. Sur, and DK Panda, Fifth Conference on Partitioned Global Address Space Programming Model (PGAS '11), Oct 2011 [Slides] [Bib - Plain]
148	Can a Decentralized Metadata Service Layer benefit Parallel Filesystems? Workshop on Interfaces and Architectures for Scientific Data Storage (IASDS '11) V. Meshram, X. Besseron, X. Ouyang, R. Rajachandrasekar, and DK Panda, held in conjunction with Cluster '11, Sep 2011 [Bib - Plain]
149	MPI Alltoall Personalized Exchange on GPGPU Clusters: Design Alternatives and Benefits A. Singh, S. Potluri, H. Wang, K. Kandalla, S. Sur, and DK Panda, International Workshop on Parallel Programming on Accelerator Clusters (PPAC '11), Sep 2011 [Slides] [Bib - Plain]
150	Design and Evaluation of Network Topology-/Speed- Aware Broadcast Algorithms for InfiniBand Clusters H. Subramoni, K. Kandalla, J. Vienne, S. Sur, B. Barth, K. Tomko, R. McLay, K. Schulz, and DK Panda, IEEE Cluster '11, Sep 2011 [Bib - Plain]
151	Optimized Non-contiguous MPI Datatype Communication for GPU Clusters: Design Implementation and Evaluation with MVAPICH2 H. Wang, S. Potluri, M. Luo, A. Singh, X. Ouyang, S. Sur, and DK Panda, IEEE Cluster '11, Sep 2011 [Slides] [Bib - Plain]
152	Optimizing MPI One Sided Communication on Multi-core InfiniBand Clusters using Shared Memory Backed Windows S. Potluri, H. Wang, V. Dhanraj, S. Sur, and DK Panda, EuroMPI '11, Sep 2011 [Bib - Plain]
153	Design and Implementation of Key Proposed MPI-3 One-Sided Communication Semantics on InfiniBand S. Potluri, S. Sur, D. Bureddy, and DK Panda, EuroMPI '11, Sep 2011 [Slides] [Poster/Short Paper] [Bib - Plain]
154	CRFS: A Lightweight User-Level Filesystem for Generic Checkpoint/Restart X. Ouyang, R. Rajachandrasekar, X. Besseron, H. Wang, J. Huang, and DK Panda, International Conference on Parallel Processing (ICPP '11), Sep 2011 [Slides] [Bib - Plain]
155	Can Checkpoint/Restart Mechanisms Benefit from Hierarchical Data Staging? Workshop on Resiliency in High Performance Computing in Clusters R. Rajachandrasekar, X. Ouyang, X. Besseron, V. Meshram, and DK Panda, Workshop on Resiliency in High Performance Computing in Clusters, Clouds, and Grids 2011, held in conjunction with EuroPar, Aug 2011 [Bib - Plain]
156	INAM - A Scalable InfiniBand Network Analysis and Monitoring Tool N. Dandapanthula, H. Subramoni, J. Vienne, K. Kandalla, S. Sur, DK Panda, and R. Brightwell, 4th International Workshop on Productivity and Performance (PROPER 2011), Aug 2011 [Slides] [Bib - Plain]
157	Designing Non-blocking Broadcast with Collective Offload on InfiniBand Clusters: A Case Study with HPL K. Kandalla, H. Subramoni, J. Vienne, K. Tomko, S. Sur, and DK Panda, Hot Interconnect '11, Aug 2011 [Bib - Plain]
158	High-Performance and Scalable Non-Blocking All-to-All with Collective Offload on InfiniBand Clusters: A Study with Parallel 3D FFT K. Kandalla, H. Subramoni, K. Tomko, D. Pekurovsky, S. Sur, and DK Panda, International Supercomputing Conference '11 (ISC'11), Jun 2011 [Bib - Plain]
159	MVAPICH2-GPU: Optimized GPU to GPU Communication for InfiniBand Clusters H. Wang, S. Potluri, M. Luo, A. Singh, S. Sur, and DK Panda, International Supercomputing Conference '11 (ISC'11), Jun 2011 [Slides] [Bib - Plain]
160	Efficient Intra-node Communication on Intel-MIC Clusters S. Potluri, A. Venkatesh, D. Bureddy, K. Kandalla, and DK Panda, International Symposium on Cluster, May 2011 [Slides] [Bib - Plain]
161	SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience J. Jose, M. Li, X. Lu, K. Kandalla, M. Arnold, and DK Panda, International Symposium on Cluster, May 2011 [Slides] [Bib - Plain]
162	High Performance Pipelined Process Migration with RDMA X. Ouyang, R. Rajachandrasekar, X. Besseron, and DK Panda, International Symposium on Cluster, May 2011 [Slides] [Bib - Plain]
163	Beyond Block I/O: Rethinking Traditional Storage Primitives X. Ouyang, D. Nellans, R. Wipfel, D. Flynn, and DK Panda, 17th IEEE International Symposium on High Performance Computer Architecture (HPCA-17), Feb 2011 [Slides] [Bib - Plain]
164	Scalable Earthquake Simulation on Petascale Supercomputers Y. Cui, K. B. Olsen, T. H. Jordan, K. Lee, J. Zhou, P. Small, D. Roten, G. Ely, DK Panda, A. Chourasia, J. Levesque, S. M. Day, and P. Maechling, SuperComputing 2010, Nov 2010 [Bib - Plain]
165	Unifying UPC and MPI Runtimes: Experience with MVAPICH J. Jose, M. Luo, S. Sur, and DK Panda, International Workshop on Partitioned Global Address Space (PGAS '10), Oct 2010 [Slides] [Bib - Plain]
166	RDMA-Based Job Migration Framework for MPI over InfiniBand Int'l Conference on Cluster Computing (Cluster '10) X. Ouyang, S. Marcarelli, R. Rajachandrasekar, and DK Panda, IEEE International Conference on Cluster Computing 2010, Sep 2010 [Bib - Plain]
167	Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters H. Subramoni, P. Lai, S. Sur, and DK Panda, International Conference on Parallel Processing (ICPP '10), Sep 2010 [Slides] [Bib - Plain]
168	Designing Power-Aware Collective Communication Algorithms for InfiniBand Clusters K. Kandalla, E. Mancini, S. Sur, and DK Panda, International Conference on Parallel Processing (ICPP '10), Sep 2010 [Slides] [Bib - Plain]
169	High Performance Design and Implementation of Nemesis Communication Layer for Two-sided and One-Sided MPI Semantics in MVAPICH2 M. Luo, S. Potluri, P. Lai, E. Mancini, H. Subramoni, K. Kandalla, S. Sur, and DK Panda, International Workshop on Parallel Programming Models and Systems Software for High-End Computing (P2S2 '10), Sep 2010 [Bib - Plain]
170	Design and Evaluation of Generalized Collective Communication Primitives with Overlap using ConnectX-2 Offload Engine H. Subramoni, K. Kandalla, S. Sur, and DK Panda, International Symposium on High Performance Interconnects 2010, Aug 2010 [Bib - Plain]
171	Quantifying Performance Benefits of Overlap using MPI-2 in a Seismic Modeling Application S. Potluri, P. Lai, K. Tomko, S. Sur, Y. Cui, M. Tatineni, K. Schulz, W. Barth, A. Majumdar, and DK Panda, 24th International Conference on Supercomputing (ICS), Jun 2010 [Bib - Plain]
172	Designing Truly One-Sided MPI-2 RMA Intra-node Communication on Multi-core Systems P. Lai, S. Sur, and DK Panda, 24th International Conference on Supercomputing (ICS), Jun 2010 [Slides] [Bib - Plain]
173	High Performance Data Transfer in Grid Environment Using GridFTP over InfiniBand H. Subramoni, P. Lai, R. Kettimuthu, and DK Panda, 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid'10), May 2010 [Slides] [Bib - Plain]
174	Enhancing Checkpoint Performance with Staging IO and SSD X. Ouyang, S. Marcarelli, and DK Panda, IEEE International Workshop on Storage Network Architecture and Parallel I/Os (SNAPI), May 2010 [Slides] [Bib - Plain]
175	Designing Topology-Aware Collective Communication Algorithms for Large Scale InfiniBand Clusters: Case Studies with Scatter and Gather K. Kandalla, H. Subramoni, A. Vishnu, and DK Panda, International Workshop on Communication Architecture for Clusters (CAC 10), Apr 2010 [Bib - Plain]
176	Designing High-Performance and Resilient Message Passing on InfiniBand M. Koop, P. Shamis, I. Rabinovitz, and DK Panda, International Workshop on Communication Architecture for Clusters (CAC 10), Apr 2010 [Bib - Plain]
177	Designing Efficient FTP Mechanisms for High Performance Data-Transfer over InfiniBand P. Lai, H. Subramoni, S. Narravula, A. Mamidala, and DK Panda, International Conference on Parallel Processing (ICPP '09), Sep 2009 [Slides] [Bib - Plain]
178	Accelerating Checkpoint Operation by Node-Level Write Aggregation on Multicore Systems X. Ouyang, K. Gopalakrishnan, and DK Panda, International Conference on Parallel Processing (ICPP '09), Sep 2009 [Slides] [Bib - Plain]
179	CIFTS: A Coordinated Infrastructure for Fault-Tolerant Systems R. Gupta, P. Beckman, H. Park, E. Lusk, P. Hargrove, A. Geist, DK Panda, A. Lumsdaine, and J. Dongarra, International Conference on Parallel Processing (ICPP '09), Sep 2009 [Bib - Plain]
180	Designing and Evaluating MPI-2 Dynamic Process Management Support for InfiniBand T. Gangadharappa, M. Koop, and DK Panda, International Workshop on Parallel Programming Models and Systems Software for High-End Computing (P2S2 '09), Sep 2009 [Bib - Plain]
181	Impact of Node Level Caching in MPI Job Launch Mechanisms J. Sridhar, and DK Panda, EuroPVM/MPI '09, Sep 2009 [Slides] [Bib - Plain]
182	An Efficient Hardware-Software Approach to Network Fault Tolerance with InfiniBand A. Vishnu, M. Krishnan, and DK Panda, International Conference on Cluster Computing (Cluster '09), Sep 2009 [Slides] [Bib - Plain]
183	Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters M. Koop, M. Luo, and DK Panda, International Conference on Cluster Computing (Cluster '09), Sep 2009 [Slides] [Bib - Plain]
184	Design Alternatives for Implementing Fence Synchronization in MPI-2 One-sided Communication on InfiniBand Clusters G. Santhanaraman, T. Gangadharappa, S. Narravula, A. Mamidala, and DK Panda, International Conference on Cluster Computing (Cluster '09), Sep 2009 [Slides] [Bib - Plain]
185	RDMA over Ethernet - A Preliminary Study H. Subramoni, P. Lai, M. Luo, and DK Panda, International Workshop on High Performance Distributed Computing (HPI-DC '09), Sep 2009 [Slides] [Bib - Plain]
186	ProOnE: A General Purpose Protocol Onload Engine for Multi- and Many-Core Architectures P. Lai, P. Balaji, R. Thakur, and DK Panda, International Supercomputing Conference (ISC), Jun 2009 [Bib - Plain]
187	Designing Multi-Leader-Based Allgather Algorithms for Multi-Core Clusters K. Kandalla, H. Subramoni, G. Santhanaraman, and DK Panda, International Workshop on Communication Architecture for Clusters (CAC'09), May 2009 [Slides] [Bib - Plain]
188	Fast Checkpointing by Write Aggregation with Dynamic Buffer and Interleaving on Multicore Architecture X. Ouyang, K. Gopalakrishnan, DK Panda, Fast Checkpointing by Write Aggregation with Dynamic Buffer, and Interleaving on Multicore Architecture, Int'l Conference on High Performance Computing 2009, Feb 2009 [Slides] [Bib - Plain]
189	ScELA: Scalable and Extensible Launching Architecture for Clusters J. Sridhar, M. Koop, J. Perkins, and DK Panda, International Symposium on High Performance Computing (HiPC), Dec 2008 [Slides] [Bib - Plain]
190	Designing High Performance pNFS With RDMA on InfiniBand R. Noronha, X. Ouyang, and DK Panda, International Symposium on High Performance Computing (HiPC), Dec 2008 [Bib - Plain]
191	Sockets Direct Protocol for Hybrid Network Stacks: A Case Study with iWARP over 10G Ethernet P. Balaji, S. Bhagvat, R. Thakur, and DK Panda, International Symposium on High Performance Computing (HiPC), Dec 2008 [Slides] [Bib - Plain]
192	Design and Evaluation of Benchmarks for Financial Applications using Advanced Message Queuing Protocol (AMQP) over InfiniBand H. Subramoni, G. Marsh, S. Narravula, P. Lai, and DK Panda, Workshop on High Performance Computational Finance (In conjunction with SC '08), Nov 2008 [OSU Technical Report Version (OSU-CISRC-10/08-TR51)] [Bib - Plain]
193	Scalable MPI Design over InfiniBand using eXtended Reliable Connection M. Koop, J. Sridhar, and DK Panda, IEEE Cluster 2008, Sep 2008 [Slides] [Bib - Plain]
194	Efficient One-Copy MPI Shared Memory Communication in Virtual Machines W. Huang, M. Koop, and DK Panda, IEEE Cluster 2008, Sep 2008 [Slides] [Bib - Plain]
195	IMCa: A High Performance Caching Frontend for GlusterFS on InfiniBand R. Noronha, and DK Panda, International Conference on Parallel Processing 2008, Sep 2008 [Slides] [Bib - Plain]
196	Performance of HPC middleware over InfiniBand WAN S. Narravula, H. Subramoni, P. Lai, R. Noronha, and DK Panda, International Conference on Parallel Processing 2008, Sep 2008 [Bib - Plain]
197	Designing An Efficient Kernel-level and User-level Hybrid Approach for MPI Intra-node Communication on Multi-core Systems L. Chai, P. Lai, H. Jin, and DK Panda, International Conference on Parallel Processing 2008, Sep 2008 [Slides] [Bib - Plain]
198	Lock-free Asynchronous Rendezvous Design for MPI Point-to-point Communication R. Kumar, A. Mamidala, M. Koop, G. Santhanaraman, and DK Panda, EuroPVM/MPI '08, Sep 2008 [OSU-CISRC-6/08-TR36] [Bib - Plain]
199	Can Software Reliability Outperform Hardware Reliability on High Performance Interconnects? A Case Study with MPI over InfiniBand M. Koop, R. Kumar, and DK Panda, 22nd ACM International Conference on Supercomputing (ICS '08), Jun 2008 [Bib - Plain]
200	Advanced RDMA-based Admission Control for Modern Data-Centers P. Lai, S. Narravula, K. Vaidyanathan, and DK Panda, CCGrid '08, May 2008 [Slides] [Bib - Plain]
201	Optimized Distributed Data Sharing Substrate in Multi-Core Commodity Clusters: A Comprehensive Study with Applications K. Vaidyanathan, and S. Narravula, CCGrid '08, May 2008 [Slides] [Bib - Plain]
202	MPI Collectives on modern Multicore clusters: Performance Optimizations and Communication Characteristics A. Mamidala, R. Kumar, D. De, and DK Panda, CCGrid '08, May 2008 [Bib - Plain]
203	Scaling Alltoall Collective on Multi-core Systems R. Kumar, A. Mamidala, and DK Panda, International Workshop on Communication Architecture for Clusters, Apr 2008 [Slides] [Bib - Plain]
204	pNFS/PVFS2 over InfiniBand: Early Experiences L. Chai, X. Ouyang, R. Noronha, and DK Panda, Petascale Data Storage Workshop, Nov 2007 [Slides] [Bib - Plain]
205	Virtual Machine Aware Communication Libraries for High Performance Computing W. Huang, M. Koop, Q. Gao, and DK Panda, SuperComputing (SC'07), Nov 2007 [Slides] [Best Student Paper Finalist] [Bib - Plain]
206	Enhancing the Performance of NFSv4 with RDMA R. Noronha, L. Chai, S. Shepler, and DK Panda, International Workshop on Storage Network Architecture and Parallel I/Os (SNAPI'07), Sep 2007 [Bib - Plain]
207	MPI-2 One Sided Usage and Implementation for Read Modify Write operations: A case study with HPCC G. Santhanaraman, S. Narravula, A. Mamidala, and DK Panda, EuroPVM/MPI 2007, Sep 2007 [Bib - Plain]
208	Zero-Copy Protocol for MPI using InfiniBand Unreliable Datagram M. Koop, S. Sur, and DK Panda, IEEE International Conference on Cluster Computing, Sep 2007 [Bib - Plain]
209	High Performance Virtual Machine Migration with RDMA over Modern Interconnects W. Huang, Q. Gao, J. Liu, and DK Panda, IEEE International Conference on Cluster Computing, Sep 2007 [Best Paper] [Bib - Plain]
210	Efficient Asynchronous Memory Copy Operations on Multi-Core Systems and I/OAT K. Vaidyanathan, L. Chai, W. Huang, and DK Panda, IEEE International Conference on Cluster Computing, Sep 2007 [Bib - Plain]
211	Group-based Coordinated Checkpointing for MPI: A Case Study on InfiniBand Q. Gao, W. Huang, M. Koop, and DK Panda, International Conference on Parallel Processing (ICPP'07), Sep 2007 [Slides] [Bib - Plain]
212	High Performance MPI over iWARP: Early Experiences S. Narravula, A. Mamidala, A. Vishnu, G. Santhanaraman, and DK Panda, High Performance MPI over iWARP: Early Experiences, Sep 2007 [Bib - Plain]
213	Designing NFS With RDMA For Security, Performance and Scalability R. Noronha, L. Chai, T. Talpey, and DK Panda, International Conference on Parallel Processing 2007, Sep 2007 [Bib - Plain]
214	Designing Next Generation Clusters: Evaluation of InfiniBand DDR/QDR on Intel Computing Platforms H. Subramoni, M. Koop, and DK Panda, International Symposium on Hot Interconnects (HotI), Aug 2007 [Slides] [Bib - Plain]
215	Performance Analysis and Evaluation of PCIe 2.0 and Quad-Data Rate InfiniBand M. Koop, W. Huang, K. Gopalakrishnan, and DK Panda, International Symposium on Hot Interconnects (HotI), Aug 2007 [Bib - Plain]
216	Performance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms S. Sur, M. Koop, L. Chai, and DK Panda, International Symposium on Hot Interconnects (HotI), Aug 2007 [Slides] [Bib - Plain]
217	High Performance MPI Design using Unreliable Datagram for Ultra-Scale InfiniBand Clusters M. Koop, S. Sur, Q. Gao, and DK Panda, 21st International ACM Conference on Supercomputing (ICS '07), Jun 2007 [Bib - Plain]
218	Nomad: Migrating OS-bypass Networks in Virtual Machines W. Huang, J. Liu, M. Koop, B. Abali, and DK Panda, Third International SIGPLAN/SIGOPS Conference on Virtual Execution Environments (VEE), Jun 2007 [Bib - Plain]
219	High Performance Distributed Lock Management Services using Network-based Remote Atomic Operations S. Narravula, A. Mamidala, A. Vishnu, K. Vaidyanathan, and DK Panda, International Sympsoium on Cluster Computing and the Grid, May 2007 [Slides] [Bib - Plain]
220	Design and Implementation of High Performance MVAPICH2: MPI2 over InfiniBand W. Huang, G. Santhanaraman, H. Jin, Q. Gao, and DK Panda, International Sympsoium on Cluster Computing and the Grid, May 2007 [Bib - Plain]
221	Benefits of I/O Acceleration Technology (I/OAT) in Clusters K. Vaidyanathan, and DK Panda, International Symposium on Performance Analysis of Systems and Software (ISPASS), Apr 2007 [Bib - Plain]
222	Designing Efficient Systems Services and Primitives for Next-Generation Data-Centers K. Vaidyanathan, S. Narravula, P. Balaji, and DK Panda, Workshop on NSF Next Generation Software(NGS) Program; held in conjunction with IPDPS, Apr 2007 [Bib - Plain]
223	Improving Scalability of OpenMP Applications on MultiCore Systems Using Large Page Support R. Noronha, and DK Panda, International Workshop on Multithreaded Architectures and Applications (MTAAP), Mar 2007 [Bib - Plain]
224	High Performance MPI on IBM 12x InfiniBand Architecture A. Vishnu, B. Benton, and DK Panda, International Workshop on High-Level Parallel Programming Models and Supportive Environments (HIPS), Mar 2007 [Bib - Plain]
225	Automatic Path Migration over InfiniBand: Early Experience A. Vishnu, A. Mamidala, S. Narravula, and DK Panda, Third International Workshop on System Management Techniques, Mar 2007 [Bib - Plain]
226	Designing Efficient Asynchronous Memory Operations Using Hardware Copy Engine: A Case Study with I/OAT K. Vaidyanathan, W. Huang, L. Chai, and DK Panda, International Workshop on Communication Architecture for Clusters (CAC), Mar 2007 [Bib - Plain]
227	Using Connection-Oriented and Connection-Less Transport on Performance and Scalability of Collective and One-sided operations: Trade-offs and Impact A. Mamidala, S. Narravula, A. Vishnu, G. Santhanaraman, and DK Panda, International Symposium on Principles and Practice of Parallel Programming (PPoPP 2007), Mar 2007 [Bib - Plain]
228	DDSS: A Low-Overhead Distributed Data Sharing Substrate for Cluster-Based Data-Centers over Modern Interconnects K. Vaidyanathan, S. Narravula, and DK Panda, International Conference on High Performance Computing (HiPC), Dec 2006 [Slides] [Bib - Plain]
229	Finding Bugs in Large-Scale Parallel Programs by Detecting Anomaly in Data Movements Q. Gao, F. Qin, and DK Panda, SuperComputing 2006, Nov 2006 [Bib - Plain]
230	Analyzing the Impact of Supporting Out-of-Order Communication on In-order Performance with iWARP P. Balaji, W. Feng, S. Bhagvat, DK Panda, R. Thakur, and W. Gropp, SuperComputing 2006, Nov 2006 [Bib - Plain]
231	High-Performance and Scalable MPI over InfiniBand with Reduced Memory Usage: An In-Depth Performance Analysis S. Sur, M. Koop, and DK Panda, SuperComputing 2006, Nov 2006 [Bib - Plain]
232	A Software Based Approach for Providing Network Fault Tolerance in Clusters Using the uDAPL Interface: MPI Level Design and Performance Evaluation A. Vishnu, P. Gupta, A. Mamidala, and DK Panda, SuperComputing 2006, Nov 2006 [Bib - Plain]
233	NemC: A Network Emulator for Cluster-of-Clusters H. Jin, S. Narravula, K. Vaidyanathan, and DK Panda, International Conf. on Computer Commn. and Networks, Oct 2006 [Bib - Plain]
234	Designing Efficient MPI Intra-node Communication Support for Modern Computer Architectures L. Chai, A. Hartono, and DK Panda, International Conference on Cluster Computing, Sep 2006 [Bib - Plain]
235	Efficient Shared Memory and RDMA based design for MPI\_Allgather over InfiniBand A. Mamidala, A. Vishnu, and DK Panda, EuroPVM/MPI, Sep 2006 [Bib - Plain]
236	Exploiting RDMA operations for Providing Efficient Fine-Grained Resource Monitoring in Cluster-based Servers K. Vaidyanathan, H. Jin, and DK Panda, Workshop on Remote Direct Memory Access (RDMA): Applications, Implementations, and Technologies, Sep 2006 [Bib - Plain]
237	Memory Scalability Evaluation of the Next-Generation Intel Bensley Platform with InfiniBand M. Koop, W. Huang, A. Vishnu, and DK Panda, International Symposium on Hot Interconnect 2006 (HotI'06), Aug 2006 [Slides] [Bib - Plain]
238	Application-Transparent Checkpoint/Restart for MPI Programs over InfiniBand Q. Gao, W. Yu, W. Huang, and DK Panda, International Conference on Parallel Processing (ICPP), Aug 2006 [Slides] [Bib - Plain]
239	High Performance Block I/O for Global File System (GFS) with InfiniBand RDMA S. Liang, W. Yu, and DK Panda, International Conference on Parallel Processing (ICPP), Aug 2006 [Bib - Plain]
240	A Case for High Performance Computing with Virtual Machines W. Huang, J. Liu, B. Abali, and DK Panda, International Conference on Supercomputing (ICS), Jun 2006 [Slides] [Bib - Plain]
241	High Performance VMM-Bypass I/O in Virtual Machines J. Liu, W. Huang, B. Abali, and DK Panda, USENIX Annual Technical Conference, Jun 2006 [Bib - Plain]
242	An MPI-Stream Hybrid Programming Model for Computational Clusters E. Mancini, G. Marsh, and DK Panda, International Symposium on Cluster Computing and the Grid (CCGrid), May 2006 [Slides] [Bib - Plain]
243	Natively Supporting True One-sided Communication in MPI on Multi-core Systems with InfiniBand G. Santhanaraman, P. Balaji, K. Gopalakrishnan, R. Thakur, W. Gropp, and DK Panda, International Symposium on Cluster Computing and the Grid (CCGrid), May 2006 [Bib - Plain]
244	Reducing Connection Memory Requirements of MPI for InfiniBand Clusters: A Message Coalescing Approach M. Koop, T. Jones, and DK Panda, International Symposium on Cluster Computing and the Grid (CCGrid), May 2006 [Bib - Plain]
245	Understanding the Impact of Multi-Core Architecture in Cluster Computing: A Case Study with Intel Dual-Core System L. Chai, Q. Gao, and DK Panda, International Symposium on Cluster Computing and the Grid (CCGrid), May 2006 [Bib - Plain]
246	Hot-Spot Avoidance With Multi-Pathing Over InfiniBand: An MPI Perspective A. Vishnu, M. Koop, A. Moody, A. Mamidala, S. Narravula, and DK Panda, International Symposium on Cluster Computing and the Grid (CCGrid), May 2006 [Bib - Plain]
247	Designing Efficient Cooperative Caching Schemes for Multi-Tier Data-Centers over RDMA-enabled Networks S. Narravula, H. Jin, K. Vaidyanathan, and DK Panda, International Symposium on Cluster Computing and the Grid (CCGrid), May 2006 [Bib - Plain]
248	MPI over uDAPL: Can High Performance and Portability Exist Across Architectures? L. Chai, R. Noronha, and DK Panda, International Sympsoium on Cluster Computing and the Grid 2006, May 2006 [Bib - Plain]
249	Designing High Performance and Scalable MPI Intra-node Communication Support for Clusters L. Chai, and DK Panda, International Sympsoium on Cluster Computing and the Grid 2006, May 2006 [Slides] [Bib - Plain]
250	Designing Next-Generation Data-Centers with Advanced Communication Protocols and Systems Services P. Balaji, K. Vaidyanathan, S. Narravula, H. Jin, and DK Panda, Workshop on NSF Next Generation Software(NGS) Program; held in conjuction with IPDPS, Apr 2006 [Slides] [Bib - Plain]
251	Shared Receive Queue based Scalable MPI Design for InfiniBand Clusters S. Sur, L. Chai, H. Jin, and DK Panda, International Parallel and Distributed Processing Symposium (IPDPS '06), Apr 2006 [Bib - Plain]
252	Adaptive Connection Management for Scalable MPI over InfiniBand W. Yu, Qi Gao, and DK Panda, International Parallel and Distributed Processing Symposium (IPDPS '06), Apr 2006 [Slides] [Bib - Plain]
253	Efficient SMP-Aware MPI-Level Broadcast over InfiniBand's Hardware Multicast A. Mamidala, L. Chai, H. Jin, and DK Panda, Communication Architecture for Clusters (CAC) Workshop, Apr 2006 [Bib - Plain]
254	Asynchronous Zero-Copy Communication for Synchronous Sockets Direct Protocol (SDP) over InfiniBand P. Balaji, S. Bhagvat, H. Jin, and DK Panda, Communication Architecture for Clusters (CAC) Workshop, Apr 2006 [Bib - Plain]
255	Benefits of High Speed Interconnects to Cluster File Systems: A Case Study with Lustre W. Yu, R. Noronha, S. Liang, and DK Panda, Communication Architecture for Clusters (CAC) Workshop, Apr 2006 [Bib - Plain]
256	RDMA Read Based Rendezvous Protocol for MPI over InfiniBand: Design Alternatives and Benefits S. Sur, L. Chai, H. Jin, and DK Panda, International Symposium on Principles and Practice of Parallel Programming (PPoPP 2006), Mar 2006 [Slides] [Bib - Plain]
257	A Case for UDP Offload Engines in LambdaGrids V. Vishwanathz, P. Balaji, W. Feng, J. Leigh, and DK Panda, International Workshop on Protocols for Fast Long-Distance Networks (PFLDnet 2006), Feb 2006 [Bib - Plain]
258	High Performance RDMA Based All-to-all Broadcast for InfiniBand Clusters S. Sur, U. Bondhugula, A. Mamidala, H. Jin, and DK Panda, International Conference on High Performance Computing (HiPC 2005), Dec 2005 [Bib - Plain]
259	Supporting MPI-2 One Sided Communication on Multi-Rail InfiniBand Clusters: Design Challenges and Performance Benefits A. Vishnu, G. Santhanaraman, W. Huang, H. Jin, and DK Panda, International Conference on High Performance Computing (HiPC 2005), Dec 2005 [Bib - Plain]
260	Supporting iWARP Compatibility and Features for Regular Network Adapters P. Balaji, H. Jin, K. Vaidyanathan, and DK Panda, Workshop on Remote Direct Memory Access (RDMA): Applications, Implementations, and Technologies, Sep 2005 [Slides] [Bib - Plain]
261	Head-to-TOE Evaluation of High-Performance Sockets over Protocol Offload Engines P. Balaji, W. Feng, Q. Gao, R. Noronha, W. Yu, and DK Panda, IEEE Cluster Computing 2005, Sep 2005 [Slides] [Bib - Plain]
262	Swapping to Remote Memory over InfiniBand: An Approach using a High Performance Network Block Device S. Liang, R. Noronha, and DK Panda, IEEE Cluster Computing 2005, Sep 2005 [Slides] [Bib - Plain]
263	Benefits of Quadrics Scatter/Gather to PVFS2 Noncontiguous I/O W. Yu, and DK Panda, International Workshop on Storage Network Architecture and Parallel I/Os (SNAPI) 2005. Sept. 2005., Sep 2005 [Slides] [Bib - Plain]
264	Can Memory-Less Network Adapters Benefit Next-Generation InfiniBand Systems? S. Sur, A. Vishnu, H. Jin, W. Huang, and DK Panda, Hot Interconnect 13 (HOTI 05), Aug 2005 [Slides] [Bib - Plain]
265	Performance Characterization of a 10-Gigabit Ethernet TOE W. Feng, P. Balaji, C. Baron, L. N. Bhuyan, and DK Panda, Hot Interconnect 13 (HOTI 05), Aug 2005 [Slides] [Bib - Plain]
266	Performance Evaluation of MM5 on Clusters With Modern Interconnects: Scalability and Impact R. Noronha, and DK Panda, Euro-Par, Aug 2005 [Bib - Plain]
267	Performance Evaluation of RDMA over IP: A Case Study with the Ammasso Gigabit Ethernet NIC H. Jin, S. Narravula, K. Vaidyanathan, P. Balaji, and DK Panda, Workshop on High Performance Interconnects for Distributed Computing (HPI-DC); In conjunction with HPDC-14, Jul 2005 [Bib - Plain]
268	High Performance Support of Parallel Virtual File System (PVFS2) over Quadrics W. Yu, S. Liang, and DK Panda, International Conference on Supercomputing (ICS '05), Jun 2005 [Bib - Plain]
269	LiMIC: Support for High-Performance MPI Intra-Node Communication on Linux Cluster H. Jin, S. Sur, L. Chai, and DK Panda, International Conference on Parallel Processing (ICPP-05), Jun 2005 [Slides] [Bib - Plain]
270	Architecture for Caching Responses with Multiple Dynamic Dependencies in Multi-Tier Data-Centers over InfiniBand S. Narravula, P. Balaji, K. Vaidyanathan, H. Jin, and DK Panda, IEEE/ACM International Symposium on Cluster Computing and the Grid (CCGrid 05), May 2005 [Slides] [Bib - Plain]
271	Can High Performance Software DSM Systems Designed With InfiniBand Features Benefit from PCI-Express? R. Noronha, and DK Panda, DSM Workshop, May 2005 [Bib - Plain]
272	Designing Multi-Level, Multi-Tier Data Center Architecture for Securing Distributed Infrastructure and Assets DK Panda, DHS Homeland Security Conference, Apr 2005 [Bib - Plain]
273	Analysis of Design Considerations for Optimizing Multi-Channel MPI over InfiniBand L. Chai, S. Sur, H. Jin, and DK Panda, Workshop on Communication Architecture on Clusters (CAC '05), Apr 2005 [Bib - Plain]
274	Scheduling of MPI-2 One Sided Operations over InfiniBand W. Huang, G. Santhanaraman, H. Jin, and DK Panda, Workshop on Communication Architecture on Clusters (CAC '05), Apr 2005 [Slides] [Bib - Plain]
275	Performance Modeling of Subnet Management on Fat Tree InfiniBand Networks using OpenSM A. Vishnu, A. Mamidala, and H.- W, Workshop on System Management Tools on Large Scale Parallel Systems, Apr 2005 [Bib - Plain]
276	Design and Implementation of Open MPI over Quadrics/Elan4 W. Yu, T. S. Woodall, R. L. Graham, and DK Panda, International Parallel and Distributed Processing Symposium (IPDPS 2005). April 2005., Apr 2005 [Slides] [Bib - Plain]
277	On the Provision of Prioritization and Soft QoS in Dynamically Reconfigurable Shared Data-Centers over InfiniBand P. Balaji, S. Narravula, K. Vaidyanathan, H. Jin, and DK Panda, IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS 05), Mar 2005 [Slides] [Bib - Plain]
278	Workload-driven Analysis of File Systems in Shared Multi-Tier Data-Centers over InfiniBand K. Vaidyanathan, P. Balaji, H. Jin, and DK Panda, Computer Architecture Evaluation using Commercial Workloads (in conjunction with HPCA), Feb 2005 [Slides] [Bib - Plain]
279	Scalable Startup of Parallel Programs over InfiniBand W. Yu, J. Wu, and DK Panda, International Conference on High Performance Computing (HiPC '04), Dec 2004 [Slides] [Bib - Plain]
280	Building Multirail InfiniBand Clusters: MPI-Level Design and Performance Evaluation J. Liu, A. Vishnu, and DK Panda, SuperComputing 2004 Conference (SC 04), Nov 2004 [Slides] [Bib - Plain]
281	Reducing Diff Overhead in Software DSM Systems using RDMA Operations in InfiniBand R. Noronha, and DK Panda, Workshop on Remote Direct Memory Access (RDMA): Applications, Implementations, and Technologies in conjunction with the IEEE Cluster, Sep 2004 [Slides] [Bib - Plain]
282	Exploiting Remote Memory Operations to Design Efficient Reconfiguration for Shared Data-Centers over InfiniBand P. Balaji, K. Vaidyanathan, S. Narravula, K. Savitha, H. Jin, and DK Panda, Workshop on Remote Direct Memory Access (RDMA): Applications, Implementations, and Technologies in conjunction with the IEEE Cluster, Sep 2004 [Slides] [Bib - Plain]
283	Sockets vs RDMA Interface over 10-Gigabit Networks: An In-depth analysis of the Memory Traffic Bottleneck P. Balaji, H. V. Shah, and DK Panda, Workshop on Remote Direct Memory Access (RDMA): Applications, Implementations, and Technologies in conjunction with the IEEE Cluster, Sep 2004 [Slides] [Bib - Plain]
284	Scalable and High Performance NIC-Based Allgather over Myrinet/GM W. Yu, D. Buntinas, and DK Panda, International Conference on Cluster Computing 2004, Sep 2004 [Slides] [Bib - Plain]
285	Efficient Barrier and Allreduce on IBA Clusters using Hardware Multicast and Adaptive Algorithms A. Mamidala, J. Liu, and DK Panda, International Conference on Cluster Computing 2004, Sep 2004 [Bib - Plain]
286	NIC-Based Offload of Dynamic User-Defined Modules for Myrinet Clusters A. Wagner, H. Jin, R. Riesen, and DK Panda, International Conference on Cluster Computing 2004, Sep 2004 [Bib - Plain]
287	Zero-Copy MPI Derived Datatype Communication over InfiniBand G. Santhanaraman, J. Wu, and DK Panda, EuroPVM/MPI 2004, Sep 2004 [Slides] [Bib - Plain]
288	Efficient Implementation of MPI-2 Passive One-Sided Communication on InfiniBand Clusters W. Jiang, J. Liu, H. Jin, DK Panda, D. Buntinas, R. Thakur, and W. Gropp, EuroPVM/MPI 2004, Sep 2004 [Slides] [Bib - Plain]
289	Performance Evaluation of InfiniBand with PCI Express J. Liu, A. Mamidala, A. Vishnu, and DK Panda, Hot Interconnect 12 (HOTI 04), Aug 2004 [Bib - Plain]
290	Efficient and Scalable All-to-All Personalized Exchange for InfiniBand-based Clusters S. Sur, H. Jin, and DK Panda, International Conference on Parallel Processing (ICPP '04), Aug 2004 [Bib - Plain]
291	Design and Implementation of MPICH2 over InfiniBand with RDMA Support J. Liu, W. Jiang, P. Wyckoff, DK Panda, D. Ashton, D. Buntinas, W. Gropp, and B. Toonen, International Parallel and Distributed Processing Symposium (IPDPS 04), Apr 2004 [Slides] [Bib - Plain]
292	Fast and Scalable MPI-Level Broadcast using InfiniBand's Hardware Multicast Support J. Liu, A. Mamidala, and DK Panda, International Parallel and Distributed Processing Symposium (IPDPS 04), Apr 2004 [Slides] [Bib - Plain]
293	High Performance Implementation of MPI Datatype Communication over InfiniBand J. Wu, P. Wyckoff, and DK Panda, International Parallel and Distributed Processing Symposium (IPDPS 04), Apr 2004 [Bib - Plain]
294	Host-Assisted Zero-Copy Remote Memory Access Communication on InfiniBand V. Tipparaju, G. Santhanaraman, J. Nieplocha, and DK Panda, International Parallel and Distributed Processing Symposium (IPDPS 04), Apr 2004 [Bib - Plain]
295	Implementing Efficient and Scalable Flow Control Schemes in MPI over InfiniBand J. Liu, and DK Panda, International Workshop on Communication Architecture for Clusters (CAC 04), Apr 2004 [Slides] [Bib - Plain]
296	Efficient and Scalable Barrier over Quadrics and Myrinet with a New NIC-Based Collective Message Passing Protocol W. Yu, and DK Panda, International Workshop on Communication Architecture for Clusters (CAC 04), Apr 2004 [Slides] [Bib - Plain]
297	High Performance MPI-2 One-Sided Communication over InfiniBand W. Jiang, J. Liu, H. Jin, DK Panda, W. Gropp, and R. Thakur, International Symposium on Cluster Computing and the Grid (CCGrid 04), Apr 2004 [Slides] [Bib - Plain]
298	Unifier: Unifying Cache Management and Communication Buffer Management for PVFS over InfiniBand J. Wu, P. Wyckoff, DK Panda, and R. Ross, International Symposium on Cluster Computing and the Grid (CCGrid 04), Apr 2004 [Bib - Plain]
299	Designing High Performance DSM Systems using InfiniBand Features R. Noronha, and DK Panda, International Workshop on Distributed Shared Memory Systems, Apr 2004 [Slides] [Bib - Plain]
300	Sockets Direct Protocol over InfiniBand in Clusters: Is it Beneficial? Int'l Symposium on Performance Analysis of Systems and Software (ISPASS 04). March P. Balaji, S. Narravula, K. Vaidyanathan, S. Krishnamoorthy, J. Wu, and DK Panda, International Symposium on Performance Analysis of Systems and Software, Apr 2004 [Bib - Plain]
301	Sockets Direct Procotol over InfiniBand in Clusters: Is it Beneficial? P. Balaji, S. Narravula, K. Vaidyanathan, S. Krishnamoorthy, J. Wu, and DK Panda, IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS 04), Apr 2004 [Slides] [Bib - Plain]
302	Evaluating the Impact of RDMA on Storage I/O over InfiniBand J. Liu, DK Panda, and M. Banikazemi, SAN-03 Workshop (in conjunction with HPCA), Feb 2004 [Slides] [Bib - Plain]
303	Application-Bypass Reduction for Large-Scale Clusters A. Wagner, D. Buntinas, R. Brightwell, and DK Panda, Cluster 2003 Conference, Dec 2003 [Bib - Plain]
304	Supporting Efficient Noncontiguous Access in PVFS over InfiniBand J. Wu, P. Wyckoff, and DK Panda, Cluster 2003 Conference, Dec 2003 [Bib - Plain]
305	Optimizing Mechanisms for Latency Tolerance in Remote Memory Access Communication V. Tipparaju, M. Krishnan, J. Nieplocha, G. Santhanaraman, and DK Panda, Cluster 2003 Conference, Dec 2003 [Bib - Plain]
306	Performance Comparison of MPI Implementations over InfiniBand, Myrinet and Quadrics J. Liu, B. Chandrasekaran, J. Wu, W. Jiang, S. Kini, W. Yu, D. Buntinas, P. Wyckoff, and DK Panda, SuperComputing 2003, Nov 2003 [Bib - Plain]
307	Scalable NIC-based Reduction on Large-scale Clusters A. Moody, J. Fernandez, F. Petrini, and DK Panda, SuperComputing (SC) Conference, Nov 2003 [Bib - Plain]
308	High Performance Broadcast Support in LA-MPI over Quadrics W. Yu, S. Sur, DK Panda, R. T. Aulwes, and R. Graham, Los Alamos Computer Science Institute (LACSI) Symposium, Oct 2003 [Slides] [Bib - Plain]
309	High Performance and Reliable NIC-Based Multicast over Myrinet/GM-2 W. Yu, D. Buntinas, and DK Panda, International Conference on Parallel Processing, Oct 2003 [Slides] [Bib - Plain]
310	PVFS over InfiniBand: Design and Performance Evaluation J. Wu, P. Wyckoff, and DK Panda, International Conference on Parallel Processing, Oct 2003 [Bib - Plain]
311	Designing a Portable MPI-2 over Modern Interconnects using uDAPL Interface L. Chai, R. Noronha, P. Gupta, G. Brown, and DK Panda, Euro PVM/MPI Conference, Sep 2003 [Bib - Plain]
312	Efficient Hardware Multicast Group Management for Multiple MPI Communicators over InfiniBand A. Mamidala, H. Jin, and DK Panda, Euro PVM/MPI Conference, Sep 2003 [Slides] [Bib - Plain]
313	Design Alternatives and Performance Trade-offs for Implementing MPI-2 over InfiniBand W. Huang, G. Santhanaraman, H. Jin, and DK Panda, Euro PVM/MPI Conference, Sep 2003 [Slides] [Bib - Plain]
314	Fast and Scalable Barrier using RDMA and Multicast Mechanisms for InfiniBand-Based Clusters S. Kini, J. Liu, J. Wu, P. Wyckoff, and DK Panda, Euro PVM/MPI Conference, Sep 2003 [Bib - Plain]
315	Demotion-Based Exclusive Caching through Demote Buffering: Design and Evaluations over Different Networks J. Wu, P. Wyckoff, and DK Panda, Workshop on Storage Network Architecture and Parallel I/O (SNAPI), Sep 2003 [Bib - Plain]
316	MIBA: A Micro-benchmark Suite for Evaluating InfiniBand Architecture Implementations B. Chandrasekaran, P. Wyckoff, and DK Panda, Performance TOOLS 2003, Sep 2003 [Bib - Plain]
317	Micro-Benchmark Level Performance Comparison of High-Speed Cluster Interconnects J. Liu, B. Chandrasekaran, W. Yu, J. Wu, D. Buntinas, S. P. Kinis, P. Wyckoff, and DK Panda, Hot Interconnects 10, Aug 2003 [Bib - Plain]
318	High Performance RDMA-Based MPI Implementation over InfiniBand J. Liu, J. Wu, S. Kini, P. Wyckoff, and DK Panda, International Conference on Supercomputing (ICS '03), Jun 2003 [Bib - Plain]
319	QoS-aware Middleware for Cluster-based Servers to Support Interactive and Resource-Adaptive Applications S. Senapathi, B. Chandrasekharan, D. Stredney, H.-W. Shen, and DK Panda, High Performance Distributed Computing, Jun 2003 [Bib - Plain]
320	Impact of High Performance Sockets on Data Intensive Applications P. Balaji, J. Wu, T. Kurc, U. Catalyurek, DK Panda, and J. Saltz, High Performance Distributed Computing, Jun 2003 [Bib - Plain]
321	Application-Bypass Broadcast in MPICH over GM D. Buntinas, DK Panda, and R. Brightwell, Cluster Computing and Grid (CCGrid '03), May 2003 [Bib - Plain]
322	Optimizing Barrier and Lock Operations in ARMCI D. Buntinas, A. Saify, DK Panda, and Jarek Nieplocha, International Workshop on Communication Architecture for Clusters (CAC '03), Apr 2003 [Bib - Plain]
323	Efficient Collective Operations using Remote Memory Operations on VIA-Based Clusters R. Gupta, P. Balaji, DK Panda, and J. Nieplocha, International Parallel and Distributed Processing Symposium (IPDPS '03), Apr 2003 [Bib - Plain]
324	NIC-Based Reduction in Myrinet Clusters: Is It Beneficial? D. Buntinas, and DK Panda, SAN-02 Workshop (in conjunction with HPCA), Apr 2003 [Bib - Plain]
325	A Portable Client/Server Communication Middleware over SANs: Design and Performance Evaluation with InfiniBand J. Liu, M. Banikazemi, B. Abali, and DK Panda, SAN-02 Workshop (in conjunction with HPCA), Apr 2003 [Bib - Plain]
326	Supporting Strong Coherency for Active Caches in Multi-Tier Data-Centers over InfiniBand S. Narravula, P. Balaji, K. Vaidyanathan, S. Krishnamoorthy, J. Wu, and DK Panda, In SAN-03 Workshop (in conjunction with HPCA), Feb 2003 [Slides] [Bib - Plain]
327	Impact of On-Demand Connection Management in MPI over VIA J. Wu, J. Liu, P. Wyckoff, and DK Panda, Cluster '02, Sep 2002 [Bib - Plain]
328	Efficient Barrier using Remote Memory Operations on VIA-Based Clusters R. Gupta, V. Tipparaju, J. Nieplocha, and DK Panda, Cluster '02, Sep 2002 [Bib - Plain]
329	High Performance User-Level Sockets over Gigabit Ethernet P. Balaji, P. Shivam, P. Wyckoff, and DK Panda, Cluster '02, Sep 2002 [Bib - Plain]
330	A QoS Framework for Clusters to support Applications with Resource Adaptivity and Predictable Performance S. Senapathi, DK Panda, D. Stredney, and H.-W. Shen, International Workshop on Quality of Service (IWQoS), May 2002 [Bib - Plain]
331	Can User Level Protocols Take Advantage of Multi-CPU NICs? P. Shivam, P. Wyckoff, and DK Panda, International Parallel and Distributed Processing Symposium (IPDPS '02), Apr 2002 [Bib - Plain]
332	MPI/IO on DAFS Over VIA: Implementation and Performance Evaluation J. Wu, and DK Panda, Communication Architecture for Clusters (CAC'02) Workshop, Apr 2002 [Bib - Plain]
333	Protocols and Strategies for Optimizing Remote Memory Operations on Clusters (CAC'02) Workshop J. Nielplocha, V. Tipparaju, A. Saify, and DK Panda, held in conjunction with IPDPS '02, Apr 2002 [Bib - Plain]
334	NIC-Based Atomic Operations on Myrinet/GM D. Buntinas, DK Panda, and W. Gropp, SAN-1 Workshop, Feb 2002 [Bib - Plain]
335	EMP: Zero-copy OS-bypass NIC-driven Gigabit Ethernet Message Passing P. Shivam, P. Wyckoff, and DK Panda, Supercomputing '01., Feb 2002 [Bib - Plain]

Ph.D. Disserations (8)
1	M. Bayatpour, Designing High Performance Hardware-assisted Communication Middlewares for Next-Generation HPC Systems, May 2021
2	C. Chu, Accelerator-enabled Communication Middleware for Large-scale Heterogeneous HPC Systems with Modern Interconnects, Jul 2020
3	J. Hashmi, Designing High Performance Shared-Address-Space and Adaptive Communication Middlewares for Next-Generation HPC Systems, Apr 2020
4	Ammar Awan, Co-designing Communication Middleware and Deep Learning Frameworks for High-Performance DNN Training on HPC Systems, Apr 2020
5	S. Chakraborty, High Performance and Scalable Cooperative Communication Middleware for Next Generation Architectures, Jun 2019
6	J. Zhang, Designing and Building Efficient HPC Cloud with Modern Networking Technologies on Heterogeneous HPC Clusters, Jul 2018
7	M. Li, Designing High-Performance Remote Memory Access for MPI and PGAS Models with Modern Networking Technologies on Heterogeneous Clusters, Nov 2017
8	S. Potluri, Enabling Efficient Use of MPI and PGAS Programming Models on Heterogeneous Clusters with High Performance Interconnects, May 2014

M.S. Thesis (8)
1	S. Srivastava, MVAPICH2-AutoTune: An Automatic Collective Tuning Framework for the MVAPICH2 MPI Library, May 2021
2	N. Senthil Kumar, Designing Optimized MPI+NCCL Hybrid Collective Communication Routines for Dense Many-GPU Clusters, May 2021
3	Kamal Raj Sankarapandian, Profiling MPI Primitives in Real-time Using OSU INAM, Apr 2020
4	R. Biswas, Benchmarking and Accelerating TensorFlow-based Deep Learning on Modern HPC Systems, Jul 2018
5	A. Augustine, Designing a Scalable Network Analysis and Monitoring Tool with MPI Support, Aug 2016
6	V. Dhanraj, Enhancement of LIMIC-Based Collectives for Multi-core Clusters, Aug 2012
7	A. Singh, Optimizing All-to-all and Allgather Communications on GPGPU Clusters, Apr 2012
8	K. Gopalakrishnan, Enhancing Fault Tolerance in MPI for Modern InfiniBand Clusters, Aug 2009

NOWLAB: Network Based Computing Lab

High Performance MPI on Infiniband Cluster

Overview

Description

Software Distribution

Journals (13)

Conferences & Workshops (335)

OMB-CXL: A Micro-Benchmark Suite for Evaluating MPI Communication Utilizing Compute Express Link Memory Devices

HINT: Designing Cache-Efficient MPI_Alltoall using Hybrid Memory Copy Ordering and Non-Temporal Instructions

Accelerating MPI AllReduce Communication with Efficient GPU-Based Compression Schemes on Modern GPU Clusters

Profiling, Storing and Monitoring HPC Communication Data at Scale by OSU INAM

Optimized All-to-all Connection Establishment for High-Performance MPI Libraries over InifiniBand

Designing In-network Computing Aware Reduction Collectives in MPI

Battle of the BlueFields: An In-Depth Comparison of the BlueField-2 and BlueField-3 SmartNICs

DPU-Bench: A Micro-Benchmark Suite to Measure Offload Efficiency Of SmartNICs

Enabling Reconfigurable HPC through MPI-based Inter-FPGA Communication

A Novel Framework for Efficient Offloading of Communication Operations to Bluefield SmartNICs

Accelerating Distributed Deep Learning Training with Compression Assisted Allgather and Reduce-Scatter Communication

Designing and Optimizing GPU-aware Nonblocking MPI Neighborhood Collective Communication for PETSc

In-Depth Evaluation of a Lower-Level Direct-Verbs API on InfiniBand-based Clusters: Early Experiences

Efficient Personalized and Non-Personalized Alltoall Communication for Modern Multi-HCA GPU-Based Clusters

Designing Efficient Pipelined Communication Schemes using Compression in MPI Libraries

Accelerating Broadcast Communication with GPU Compression for Deep Learning Workloads

Designing Hierarchical Multi-HCA Aware Allgather in MPI

Network-Assisted Non-Contiguous Transfers for GPU-Aware MPI Libraries

Arm meets Cloud: A Case Study of MPI Library Performance on AWS Arm-based HPC Cloud with Elastic Fabric Adapter

Towards Java-based HPC using the MVAPICH2 Library: Early Experiences

Highly Efficient Alltoall and Alltoallv Communication Algorithms for GPU Systems

OMB-Py: Python Micro-Benchmarks for Evaluating Performance of MPI Libraries on HPC Systems

Hy-Fi: Hybrid Five-Dimensional Parallel DNN Training on High-Performance GPU Clusters

Accelerating MPI All-to-All Communication with Online Compression on Modern GPU Clusters

Layout aware Hardware assisted Designs for Derived Data Types in MPI

Towards Architecture-aware Hierarchical Communication Trees on Modern HPC Systems

Large-Message Nonblocking MPI_Iallgather and MPI_Ibcast Offload via BlueField-2 DPU

Accelerating CPU-based Distributed DNN Training on Modern HPC Clusters using BlueField-2 DPUs

BluesMPI: Efficient MPI Non-blocking Alltoall Offloading Designs on Modern BlueField Smart NICs

Designing a ROCm-aware MPI Library for AMD GPUs: Early Experiences

SUPER: SUb-Graph Parallelism for TransformERs

Designing High-Performance MPI Libraries with On-the-fly Compression for Modern GPU Clusters

Adaptive and Hierarchical Large Message All-to-all Communication Algorithms for Large-scale Dense GPU Systems

A Hierarchical and Load-Aware Design for Large Message Neighborhood Collectives

GEMS: GPU Enabled Memory Aware Model Parallelism System for Distributed DNN Training

Scalable MPI Collectives using SHARP: Large Scale Performance Evaluation on the TACC Frontera System

MPI Meets Cloud: Case Study with Amazon EC2 and Microsoft Azure

Exploring Hybrid MPI+Kokkos Tasks Programming Model

Design and Characterization of Infiniband Hardware Tag Matching in MPI

Dynamic Kernel Fusion for Bulk Non-contiguous Data Transfer on GPU Clusters

NV-Group: Link-Efficient Reductions for Distributed Deep Learning on Modern Dense GPU Systems

Communication-Aware Hardware-Assisted MPI Overlap Engine

HyPar-Flow: Exploiting MPI and Keras for Scalable Hybrid-Parallel DNN Training with TensorFlow

Machine-agnostic and Communication-aware Designs for MPI on Emerging Architectures

Performance Characterization of Network Mechanisms for Non-Contiguous Data Transfers in MPI

Analyzing and Understanding the Impact of Interconnect Performance on HPC, Big Data, and Deep Learning Applications: A Case Study with InfiniBand EDR and HDR

High-Performance Adaptive MPI Derived Datatype Communication for Modern Multi-GPU Systems

Design and Evaluation of Shared Memory CommunicationBenchmarks on Emerging Architectures using MVAPICH2

OMB-UM: Design, Implementation, and Evaluation of CUDA Unified Memory Aware MPI Benchmarks

Scaling TensorFlow, PyTorch, and MXNet using MVAPICH2 for High-Performance Deep Learning on Frontera

Performance Characterization of DNN Training using TensorFlow and PyTorch on Modern Clusters

Designing Scalable and High-performance MPI Libraries on Amazon Elastic Fabric Adapter

Performance Evaluation of MPI Libraries on GPU-enabled OpenPOWER Architectures: Early Experiences

Reduction Operations on Modern Supercomputers: Challenges and Solutions

FALCON: Efficient Designs for Zero-copy MPI Datatype Processing on Emerging Architectures

C-GDR: High-Performance Container-aware GPUDirect MPI Communication Schemes on RDMA Networks

Design and Characterization of Shared Address Space MPI Collectives on Modern Architectures

Scalable Distributed DNN Training using TensorFlow and CUDA-Aware MPI: Characterization, Designs, and Performance Evaluation

Characterizing CUDA Unified Memory (UM)-AwareMPI Designs on Modern GPU Architectures

Cooperative Rendezvous Protocols for Improved Performance and Overlap

Optimized Broadcast for Deep Learning Workloads on Dense-GPU InfiniBand Clusters: MPI or NCCL?

Multi-Threading and Lock-Free MPI RMA Based Graph Processing on KNL and POWER Architectures

Efficient Asynchronous Communication Progress for MPI without Dedicated Resources

SALaR: Scalable and Adaptive Designs for Large Message Reduction Collectives

Designing Efficient Shared Address Space Reduction Collectives for Multi-/Many-cores

Kernel-assisted Communication Engine for MPI on Emerging Manycore Processors

Designing Registration Caching Free High-Performance MPI Library with Implicit On-Demand Paging (ODP) of InfiniBand

MPI-LiFE: Designing High-Performance Linear Fascicle Evaluation of Brain Connectome with MPI

An In-depth Performance Characterization of CPU- and GPU-based DNN Training on Modern Architectures

Scalable Reduction Collectives with Data Partitioning-based Multi-Leader Design

Performance of PGAS Models on KNL: A Comprehensive Study with MVAPICH2-X

Advancing MPI Libraries to the Many-core Era: Designs and Evaluations with MVAPICH2

MPI Performance Engineering with the MPI Tool Interface: the Integration of MVAPICH and TAU