NOWLAB :: Publications

Journals (17)
1	Q. Anthony, B. Michalowicz, J. Hatef, L. Xu, M. Abduljabbar, A. Shafi, H. Subramoni, and DK Panda, Understanding and Characterizing Communication Characteristics for Distributed Transformer Models, IEEE Micro, Jan 2025.
2	T. Tran, G. Kuncham, B. Ramesh, S. Xu, H. Subramoni, and DK Panda, OHIO: Enhancing RDMA Scalability in Alltoall with Optimized Communication Overlap, IEEE Micro, Jan 2025.
3	T. Tran, B. Ramesh, B. Michalowicz, M. Abduljabbar, H. Subramoni, A. Shafi, and DK Panda, Accelerating Communication with Multi-HCA Aware Collectives in MPI, Concurrency and Computation: Practice and Experience (CCPE), July 2023,
4	K. Suresh, K. Khorassani, C. Chen, B. Ramesh, M. Abduljabbar, A. Shafi, H. Subramoni, and DK Panda, Network Assisted Non-Contiguous Transfers for GPU-Aware MPI Libraries, IEEE Micro, Jan 2023.
5	K. Khorassani, C. Chen, B. Ramesh, A. Shafi, H. Subramoni, and DK Panda, High Performance MPI over the Slingshot Interconnect, Special Issue of Journal of Computer Science and Technology (JCST), Feb 2023.
6	A. Jain, N. Alnaasan, A. Shafi, H. Subramoni, and DK Panda, Optimizing Distributed DNN Training using CPUs and BlueField-2 DPUs, IEEE Micro, doi: 10.1109/MM.2021.3139027,
7	DK Panda, H. Subramoni, C. Chu, and M. Bayatpour, The MVAPICH project: Transforming Research into High-Performance MPI Library for HPC Community , Journal of Computational Science (JOCS), Special Issue on Translational Computer Science, Oct 2020.
8	J. Hashmi, C. Chu, S. Chakraborty, M. Bayatpour, H. Subramoni, and DK Panda, FALCON-X: Zero-copy MPI Derived Datatype Processing on Modern CPU and GPU Architectures, Journal of Parallel and Distributed Computing (JPDC), Volume 144, October 2020, Pages 1-13, doi.org/10.1016/j.jpdc.2020.05.008,
9	Ammar Awan, A. Jain, C. Chu, H. Subramoni, and DK Panda, Communication Profiling and Characterization of Deep Learning Workloads on Clusters with High-Performance Interconnects, IEEE Micro, vol. 40, no. 1, pp. 35-43, 1 Jan.-Feb. 2020.,
10	A. Ruhela, H. Subramoni, S. Chakraborty, M. Bayatpour, P. Kousha, and DK Panda, Effcient Design for MPI Asynchronous Progress without Dedicated Resources, Parallel Computing - Systems & Applications, Volume 85, July 2019, Pages 13-26, https://doi.org/10.1016/j.parco.2019.03.003,
11	Ammar Awan, K. Vadambacheri Manian, C. Chu, H. Subramoni, and DK Panda, Optimized Large-Message Broadcast for Deep Learning Workloads: MPI, MPI+NCCL, or NCCL2?, Volume 85, July 2019, Pages 141-152, https://doi.org/10.1016/j.parco.2019.03.005,
12	C. Chu, X. Lu, Ammar Awan, H. Subramoni, Bracy Elton, and DK Panda, Exploiting Hardware Multicast and GPUDirect RDMA for Efficient Broadcast, IEEE Transactions on Parallel and Distributed Systems (TPDS), vol. 30, no. 3, pp. 575-588, 1 March 2019,
13	S. Chakraborty, Ignacio Laguna, Murali Emani, Kathryn Mohror, DK Panda, Martin Schulz, and H. Subramoni, EReinit: Scalable and Efficient Fault Tolerance for Bulk-Synchronous MPI Applications, Concurrency and Computation: Practice and Experience, 14 August 2018, https://doi.org/10.1002/cpe.4863,
14	S. Ramesh, A. Mahéo, S. Shende, A. Malony, H. Subramoni, A. Ruhela, and DK Panda, MPI performance engineering with the MPI tool interface: The integration of MVAPICH and TAU, ISSN 0167-8191, Volume 77, Sep 2018.
15	K. Hamidouche, A. Venkatesh, Ammar Awan, H. Subramoni, and DK Panda, CUDA-Aware OpenSHMEM: Extensions and Designs for High Performance OpenSHMEM on GPU Clusters, ParCo: Elsevier Parallel Computing Journal ,
16	S. Sur, S. Potluri, K. Kandalla, H. Subramoni, K. Tomko, and DK Panda, Co-Designing MPI Library and Applications for InfiniBand Clusters IEEE Computer, Nov 2011.
17	Srinivasan Ramesh, Aurele Maheo, Sameer Shende, Allen Malony, H. Subramoni, and DK Panda, MPI Performance Engineering with the MPI Tool Interface: the Integration of MVAPICH and TAU, Sep 2018.

Conferences & Workshops (187)
1	Design and Implementation of Casting Compression for GPU-Aware MPI Collectives C. Chen, N. Contini, L. Xu, J. Queiser, H. Subramoni, and DK Panda, 40th IEEE International Parallel & Distributed Processing Symposium, May 2026 [Bib - Plain]
2	From Skew to Symmetry: Node-Interconnect Multi-Path Balancing with Execution-time Planning for Modern GPU Clusters J. Yao, K. Suresh, B. Ramesh, H. Subramoni, and DK Panda, 40th IEEE International Parallel & Distributed Processing Symposium, May 2026 [Bib - Plain]
3	Design and Optimization of GPU-Aware MPI Allreduce Using Direct Sendrecv Communication C. Chen, J. Yao, H. Subramoni, and DK Panda, NVIDIA GTC AI Conference 2026, Mar 2026 [Research Poster] [Bib - Plain]
4	A Streaming Collectives Interface Targeting Dataflow Acceleration and HPC Workloads N. Contini, J. Queiser, B. Ramesh, H. Subramoni, and DK Panda, The International Conference for High Performance Computing, Networking, Storage, and Analysis 2025, Nov 2025 [Bib - Plain]
5	Design and Optimization of GPU-Aware MPI Allreduce Using Direct Sendrecv Communication C. Chen, J. Yao, H. Subramoni, and DK Panda, 54th International Conference on Parallel Processing, Sep 2025 [Bib - Plain]
6	Towards Dynamic Message Passing Protocols for Stencil-Based Communication Patterns K. Suresh, B. Ramesh, G. Kuncham, H. Subramoni, and DK Panda, IEEE International Conference on Cluster Computing 2025, Sep 2025 [Bharath and Kaushik are Co-Lead Authors] [Bib - Plain]
7	OMB-Compr: An Extension to OSU Micro Benchmarks for Collective Compression Error Measurement J. Queiser, N. Contini, H. Subramoni, and DK Panda, Practice and Experience in Advanced Research Computing 2025, Jul 2025 [Bib - Plain]
8	Use of BlueField-SmartNICs in Offloading One-Sided Communication Primitives B. Michalowicz, K. Suresh, H. Subramoni, DK Panda, and S. Poole, ISC HIGH PERFORMANCE 2025, Jun 2025 [Research Poster] [Bib - Plain]
9	Design and Implementation of MPI Collective Operations for Large Message Communication on AMD GPUs C. Chen, L. Xu, H. Subramoni, and DK Panda, ISC HIGH PERFORMANCE 2025, Jun 2025 [Research Poster] [Bib - Plain]
10	Design and Implementation of a GPU-Aware MPI Collective Library for Intel GPUs C. Chen, G. Kuncham, H. Subramoni, and DK Panda, ISC HIGH PERFORMANCE 2025, Jun 2025 [Research Poster] [Bib - Plain]
11	Unified Designs of Multi-rail-aware MPI Allreduce and Alltoall Operations Across Diverse GPU and Interconnect Systems C. Chen, J. Yao, L. Xu, H. Subramoni, and DK Panda, 39th IEEE International Parallel & Distributed Processing Symposium, Jun 2025 [Bib - Plain]
12	Training ultra long context language model with fully pipelined distributed transformer J. Yao, S. Jacobs, M. Tanaka, O. Ruwase, H. Subramoni, and DK Panda, The Eighth Annual Conference on Machine Learning and Systems, May 2025 [Bib - Plain]
13	Effective and Efficient Offloading Designs for One-Sided Communication to SmartNICs B. Michalowicz, K. Suresh, H. Subramoni, M. Abduljabbar, DK Panda, and S. Poole, 31st IEEE International Conference on High Performance Computing, Data, and Analytics, Dec 2024 [Bib - Plain]
14	Using BlueField-3 SmartNICs to Offload Vector Operations in Krylov Subspace Methods K. Suresh, B. Michalowicz, N. Contini, B. Ramesh, M. Abduljabbar, A. Shafi, H. Subramoni, and DK Panda, 31st IEEE International Conference on High Performance Computing, Data, and Analytics, Dec 2024 [Bib - Plain]
15	Design and Implementation of Kernel-based MPI Reduction Operations for Intel GPUs C. Chen, G. Kuncham, H. Subramoni, and DK Panda, 31st IEEE International Conference on High Performance Computing, Data, and Analytics, Dec 2024 [Bib - Plain]
16	Scaling Large Language Model Training on Frontier with Low-Bandwidth Partitioning L. Xu, Q. Anthony, J. Hatef, A. Shafi, H. Subramoni, and DK Panda, 31st IEEE International Conference on High Performance Computing, Data, and Analytics, Dec 2024 [Bib - Plain]
17	HyperSack: Distributed Hyperparameter Optimization for Deep Learning using Resource-Aware Scheduling on Heterogeneous GPU Systems N. Alnaasan, B. Ramesh, J. Yao, A. Shafi, H. Subramoni, and DK Panda, 31st IEEE International Conference on High Performance Computing, Data, and Analytics, Dec 2024 [Bib - Plain]
18	HARVEST-2.0: High-Performance Vision Framework for End-to-end Preprocessing, Training, Inference, and Visualization N. Alnaasan, A. Potlapally, T. Chen, M. Lieber, A. Shafi, H. Subramoni, S. Shearer, and DK Panda, The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC'24), Nov 2024 [Research Poster] [Bib - Plain]
19	Demystifying the Communication Characteristics for Distributed Transformer Models Q. Anthony, B. Michalowicz, J. Hatef, L. Xu, M. Abduljabbar, A. Shafi, H. Subramoni, and DK Panda, IEEE Hot Interconnects Symposium 2024, Aug 2024 [Q. Anthony and B. Michalowicz are co-lead authors] [Bib - Plain]
20	Characterizing Communication in Distributed Parameter-Efficient Fine-Tuning for Large Language Models N. Alnaasan, H. Huang, A. Shafi, H. Subramoni, and DK Panda, IEEE Hot Interconnects Symposium 2024, Aug 2024 [Bib - Plain]
21	OHIO: Improving RDMA Network Scalability in MPI_Alltoall through Optimized Hierarchical and Intra/Inter-Node Communication Overlap Design T. Tran, G. Kuncham, B. Ramesh, S. Xu, H. Subramoni, M. Abduljabbar, and DK Panda, IEEE Hot Interconnects Symposium 2024, Aug 2024 [Bib - Plain]
22	The Case for Co-Designing Model Architectures with Hardware Q. Anthony, J. Hatef, Deepak Narayanan, Stella Biderman, Stas Bekman, Junqi Yin, A. Shafi, H. Subramoni, and DK Panda, 53rd International Conference on Parallel Processing, Aug 2024 [Bib - Plain]
23	Design and Implementation of an IPC-based Collective MPI Library for Intel GPUs C. Chen, G. Kuncham, P. Kousha, H. Subramoni, and DK Panda, Practice and Experience in Advanced Research Computing, Jul 2024 [Bib - Plain]
24	OMB-CXL: A Micro-Benchmark Suite for Evaluating MPI Communication Utilizing Compute Express Link Memory Devices T. Tran, M. Abduljabbar, Hooyoung Ahn, Seonyoung Kim, Yoomi Park, Woojong Han, Shinyoung Ahn, H. Subramoni, and DK Panda, Practice and Experience in Advanced Research Computing, Jul 2024 [July 21st to July 25th, 2024 in Providence, RI.] [Bib - Plain]
25	A Novel LLM-enabled Framework for Accelerating the Creation of Knowledge Graphs for HPC P. Kousha, V. Sathu, M. Han, J. Jani, N. Alnaasan, H. Subramoni, and DK Panda, Practice and Experience in Advanced Research Computing, Jul 2024 [Jul 2024] [Bib - Plain]
26	OMB-FPGA: A Microbenchmark Suite for FPGA-aware MPIs using OpenCL and SYCL N. Contini, M. Abduljabbar, H. Subramoni, and DK Panda, Practice and Experience in Advanced Research Computing, Jul 2024 [Jul 2024] [Bib - Plain]
27	Infer-HiRes: Accelerating Inference for High-Resolution Images with Quantization and Distributed Deep Learning R. Gulhane, Q. Anthony, H. Subramoni, and DK Panda, Practice and Experience in Advanced Research Computing, Jul 2024 [Jul 2024] [Bib - Plain]
28	PML-MPI: A Pre-Trained ML Framework for Efficient Collective Algorithm Selection in MPI M. Han, G. Kuncham, B. Michalowicz, R. Vaidya, M. Abduljabbar, A. Shafi, H. Subramoni, and DK Panda, iWAPT '24 (IPDPSW), May 2024 [Bib - Plain]
29	Exploiting Inter-Layer Expert Affinity for Accelerating Mixture-of-Experts Model Inference J. Yao, Q. Anthony, A. Shafi, H. Subramoni, and DK Panda, 38th IEEE International Parallel & Distributed Processing Symposium, May 2024 [Bib - Plain]
30	HINT: Designing Cache-Efficient MPI_Alltoall using Hybrid Memory Copy Ordering and Non-Temporal Instructions B. Ramesh, N. Contini, N. Alnaasan, K. Suresh, M. Abduljabbar, A. Shafi, H. Subramoni, and DK Panda, 38th IEEE International Parallel & Distributed Processing Symposium, May 2024 [Bib - Plain]
31	Accelerating MPI AllReduce Communication with Efficient GPU-Based Compression Schemes on Modern GPU Clusters Q. Zhou, B. Ramesh, A. Shafi, M. Abduljabbar, H. Subramoni, and DK Panda, ISC HIGH PERFORMANCE 2024, May 2024 [Bib - Plain]
32	Profiling, Storing and Monitoring HPC Communication Data at Scale by OSU INAM P. Kousha, H. Subramoni, DK Panda, M. Tatineni, and P. Mulrooney, ISC HIGH PERFORMANCE 2024, May 2024 [Research Poster] [Bib - Plain]
33	High-Performance Semi-Supervised Learning with HARVEST: A Distributed Computer Vision Framework for Expert Labeling N. Alnaasan, M. Lieber, A. Shafi, H. Subramoni, S. Shearer, and DK Panda, ISC HIGH PERFORMANCE 2024, May 2024 [Research Poster] [Best Poster Award] [Bib - Plain]
34	Accelerating Large Language Model Training with Hybrid GPU-based Compression L. Xu, Q. Anthony, Q. Zhou, N. Alnaasan, R. Gulhane, A. Shafi, H. Subramoni, and DK Panda, IEEE/ACM International Symposium on Cluster, Cloud, and Internet Computing 2024, May 2024 [Bib - Plain]
35	AccDP: Accelerated Data-Parallel Distributed DNN Training for Modern GPU-Based HPC Clusters N. Alnaasan, A. Shafi, H. Subramoni, and DK Panda, NVIDIA GTC AI Conference 2024, Mar 2024 [Research Poster] [Bib - Plain]
36	Optimized All-to-all Connection Establishment for High-Performance MPI Libraries over InifiniBand S. Xu, G. Kuncham, M. Abduljabbar, H. Subramoni, and DK Panda, 30th IEEE INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING, DATA, & ANALYTICS, Dec 2023 [Bib - Plain]
37	Flover: A Temporal Fusion Framework for Efficient Autoregressive Model Parallel Inference J. Yao, N. Alnaasan, T. Chen, A. Shafi, H. Subramoni, and DK Panda, 30th IEEE INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING, DATA, & ANALYTICS, Dec 2023 [Bib - Plain]
38	MPI Allgather Utilizing CXL Shared Memory Pool in Multi-Node Computing Systems Hooyoung Ahn, Seonyoung Kim, Yoomi Park, Woojong Han, Shinyoung Ahn, T. Tran, B. Ramesh, H. Subramoni, and DK Panda, IEEE International Conference on Big Data, Dec 2023 [Dec 15-18, 2024 @ Washington DC, USA] [Bib - Plain]
39	HARVEST: High-Performance Artificial Vision Framework for Expert Labeling using Semi-Supervised Training N. Alnaasan, M. Lieber, A. Shafi, H. Subramoni, S. Shearer, and DK Panda, 2023 IEEE International Conference on Big Data, Dec 2023 [Bib - Plain]
40	MPI4Spark Meets YARN: Enhancing MPI4Spark through YARN support for HPC K. Al Attar, A. Shafi, H. Subramoni, and DK Panda, 11th International Workshop on Distributed Storage and Blockchain Technologies for Big Data (IEEE Big Data '23), Dec 2023 [Bib - Plain]
41	Benchmarking Modern Databases for Storing and Profiling Very Large Scale HPC Communication Data P. Kousha, Q. Zhou, H. Subramoni, and DK Panda, The 15th BenchCouncil International Symposium On Benchmarking, Measuring And Optimizing, Dec 2023 [Bib - Plain]
42	MPI-xCCL: A Portable MPI Library over Collective Communication Libraries for Various Accelerators C. Chen, K. Khorassani, P. Kousha, Q. Zhou, J. Yao, H. Subramoni, and DK Panda, Sixth Annual Workshop on Emerging Parallel and Distributed Runtime Systems and Middleware, Nov 2023 [Bib - Plain]
43	Democratizing HPC Access and Use with Knowledge Graphs P. Kousha, V. Sathu, M. Lieber, H. Subramoni, and DK Panda, D-HPC 2023: The First International Workshop on Democratizing High-Performance Computing, Nov 2023 [Bib - Plain]
44	Battle of the BlueFields: An In-Depth Comparison of the BlueField-2 and BlueField-3 SmartNICs B. Michalowicz, K. Suresh, H. Subramoni, DK Panda, and S. Poole, Hot Interconnects 2023, Aug 2023 [Bib - Plain]
45	DPU-Bench: A Micro-Benchmark Suite to Measure Offload Efficiency Of SmartNICs B. Michalowicz, K. Suresh, H. Subramoni, DK Panda, and S. Poole, Practice and Experience in Advanced Research Computing 23, Jul 2023 [Bib - Plain]
46	Optimizing Amber for Device-to-Device GPU Communication S. Khuvis, K. Tomko, S. Brozell, C. Chen, H. Subramoni, and DK Panda, Practice and Experience in Advanced Research Computing 23, Jul 2023 [Bib - Plain]
47	Enabling Reconfigurable HPC through MPI-based Inter-FPGA Communication N. Contini, B. Ramesh, K. Suresh, T. Tran, B. Michalowicz, M. Abduljabbar, H. Subramoni, and DK Panda, International Conference on Supercomputing 2023, Jun 2023 [Bib - Plain]
48	SAI: AI-Enabled Speech Assistant Interface for Science Gateways in HPC P. Kousha, A. Jain, A. Kolli, M. Lieber, M. Han, N. Contini, H. Subramoni, and DK Panda, ISC HIGH PERFORMANCE 2023, May 2023 [Bib - Plain]
49	Accelerating Distributed Deep Learning Training with Compression Assisted Allgather and Reduce-Scatter Communication Q. Zhou, Q. Anthony, L. Xu, A. Shafi, M. Abduljabbar, H. Subramoni, and DK Panda, 37th IEEE International Parallel & Distributed Processing Symposium (IPDPS '23), May 2023 [Bib - Plain]
50	MCR-DL: Mix-and-Match Communication Runtime for Deep Learning Q. Anthony, Ammar Awan, J. Rasley, Y. He, A. Shafi, M. Abduljabbar, H. Subramoni, and DK Panda, 37th IEEE International Parallel & Distributed Processing Symposium (IPDPS '23), May 2023 [Bib - Plain]
51	A Novel Framework for Efficient Offloading of Communication Operations to Bluefield SmartNICs K. Suresh, B. Michalowicz, B. Ramesh, N. Contini, J. Yao, S. Xu, A. Shafi, H. Subramoni, and DK Panda, 37th IEEE International Parallel & Distributed Processing Symposium (IPDPS '23), May 2023 [Bib - Plain]
52	Designing and Optimizing GPU-aware Nonblocking MPI Neighborhood Collective Communication for PETSc K. Khorassani, C. Chen, H. Subramoni, and DK Panda, 37th IEEE International Parallel & Distributed Processing Symposium (IPDPS '23), May 2023 [Bib - Plain]
53	In-Depth Evaluation of a Lower-Level Direct-Verbs API on InfiniBand-based Clusters: Early Experiences B. Michalowicz, K. Suresh, B. Ramesh, A. Shafi, H. Subramoni, M. Abduljabbar, and DK Panda, 25th Workshop on Advances in Parallel and Distributed Computational Models, May 2023 [Held in conjunction with IPDPS 2023] [Bib - Plain]
54	Implementing and Optimizing a GPU-aware MPI Library for Intel GPUs: Early Experiences C. Chen, K. Khorassani, G. Kuncham, R. Vaidya, M. Abduljabbar, A. Shafi, H. Subramoni, and DK Panda, THE 23RD IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing, May 2023 [Bib - Plain]
55	ScaMP: Scalable Meta-Parallelism for Deep Learning Search Q. Anthony, L. Xu, A. Shafi, H. Subramoni, and DK Panda, THE 23RD IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing, May 2023 [Bib - Plain]
56	Performance Characterization of using Quantization for DNN Inference on Edge Devices H. Ahn, T. Chen, N. Alnaasan, A. Shafi, M. Abduljabbar, H. Subramoni, and DK Panda, 7TH IEEE INTERNATIONAL CONFERENCE ON FOG AND EDGE COMPUTING, May 2023 [Bib - Plain]
57	AccDP: Accelerated Data-Parallel Distributed DNN Training for Modern GPU-Based HPC Clusters N. Alnaasan, A. Jain, A. Shafi, H. Subramoni, and DK Panda, 29th IEEE International Conference on High Performance Computing, Data, and Analytics, Dec 2022 [Bib - Plain]
58	Designing Efficient Pipelined Communication Schemes using Compression in MPI Libraries B. Ramesh, Q. Zhou, A. Shafi, M. Abduljabbar, H. Subramoni, and DK Panda, 29th IEEE International Conference on High Performance Computing, Data, and Analytics, Dec 2022 [Bib - Plain]
59	Accelerating Broadcast Communication with GPU Compression for Deep Learning Workloads Q. Zhou, Q. Anthony, A. Shafi, H. Subramoni, and DK Panda, 29th IEEE International Conference on High Performance Computing, Data, and Analytics, Dec 2022 [Bib - Plain]
60	Efficient Personalized and Non-Personalized Alltoall Communication for Modern Multi-HCA GPU-Based Clusters K. Suresh, A. Paniraja Guptha, B. Michalowicz, B. Ramesh, M. Abduljabbar, A. Shafi, H. Subramoni, and DK Panda, 29th IEEE International Conference on High Performance Computing, Data, and Analytics, Dec 2022 [Bib - Plain]
61	Spark Meets MPI: Towards High-Performance Communication Framework for Spark using MPI K. Al Attar, A. Shafi, M. Abduljabbar, H. Subramoni, and DK Panda, IEEE Cluster '22, Sep 2022 [Bib - Plain]
62	Designing Hierarchical Multi-HCA Aware Allgather in MPI T. Tran, B. Michalowicz, B. Ramesh, H. Subramoni, A. Shafi, and DK Panda, Fifteenth International Workshop on Parallel Programming Models and Systems Software for High-End Computing (P2S2), 2022, Aug 2022 [To be held in conjunction with ICPP 2022: The 51st International Conference on Parallel Processing August 29th to Sept 1st, 2022 in Bordeaux, France] [Bib - Plain]
63	High Performance MPI over the Slingshot Interconnect: Early Experiences K. Khorassani, C. Chen, B. Ramesh, A. Shafi, H. Subramoni, and DK Panda, Practice and Experience in Advanced Research Computing, Jul 2022 [Best Student Paper Award] [Bib - Plain]
64	Arm meets Cloud: A Case Study of MPI Library Performance on AWS Arm-based HPC Cloud with Elastic Fabric Adapter S. Xu, A. Shafi, H. Subramoni, and DK Panda, 24th Workshop on Advances in Parallel and Distributed Computational Models, May 2022 [Bib - Plain]
65	Towards Java-based HPC using the MVAPICH2 Library: Early Experiences K. Al Attar, A. Shafi, H. Subramoni, and DK Panda, HIPS '22 (IPDPSW), May 2022 [Bib - Plain]
66	Highly Efficient Alltoall and Alltoallv Communication Algorithms for GPU Systems C. Chen, K. Khorassani, Q. Anthony, A. Shafi, H. Subramoni, and DK Panda, Heterogeneity in Computing Workshop (HCW 2022), May 2022 [held in conjunction with IPDPS'22] [Bib - Plain]
67	OMB-Py: Python Micro-Benchmarks for Evaluating Performance of MPI Libraries on HPC Systems N. Alnaasan, A. Jain, A. Shafi, H. Subramoni, and DK Panda, 23rd Parallel and Distributed Scientific and Engineering Computing Workshop (PDSEC) at IPDPS22, May 2022 [Bib - Plain]
68	Hy-Fi: Hybrid Five-Dimensional Parallel DNN Training on High-Performance GPU Clusters A. Jain, A. Shafi, Q. Anthony, P. Kousha, H. Subramoni, and DK Panda, ISC HIGH PERFORMANCE 2022, May 2022 [Bib - Plain]
69	Accelerating MPI All-to-All Communication with Online Compression on Modern GPU Clusters Q. Zhou, P. Kousha, Q. Anthony, K. Khorassani, A. Shafi, H. Subramoni, and DK Panda, ISC HIGH PERFORMANCE 2022, May 2022 [Bib - Plain]
70	"Hey CAI" - Enhancing User Productivity through a Conversational AI Enabled User Interface for HPC Tools P. Kousha, A. Jain, A. Kolli, S. Prasanna, S. Miriyala, H. Subramoni, A. Shafi, and DK Panda, ISC HIGH PERFORMANCE 2022, May 2022 [Bib - Plain]
71	OMB-Py: Python Micro-Benchmarks for Evaluating Performance of MPI Libraries and Machine Learning Applications on HPC Systems N. Alnaasan, A. Jain, A. Shafi, H. Subramoni, and DK Panda, ISC HIGH PERFORMANCE 2022, May 2022 [Research Poster] [Best Poster Award] [Bib - Plain]
72	DistMILE: A Distributed Multi-Level Framework for Scalable Graph Embedding Yuntian He, Saket Gurukar, P. Kousha, H. Subramoni, and Dhabaleswar K. Panda and Srinivasan Parthasarathy, 28th IEEE International Conference on High Performance Computing, Data, Analytics, and Data Science, Dec 2021 [Bib - Plain]
73	Towards Architecture-aware Hierarchical Communication Trees on Modern HPC Systems B. Ramesh, J. Hashmi, S. Xu, A. Shafi, M. Ghazimirsaeed, M. Bayatpour, H. Subramoni, and DK Panda, 28th IEEE International Conference on High Performance Computing, Data, Analytics, and Data Science, Dec 2021 [Best Paper Finalist] [Bib - Plain]
74	Layout aware Hardware assisted Designs for Derived Data Types in MPI K. Suresh, B. Ramesh, C. Chen, M. Ghazimirsaeed, M. Bayatpour, A. Shafi, H. Subramoni, and DK Panda, 28th IEEE International Conference on High Performance Computing, Data, Analytics, and Data Science, Dec 2021 [Bib - Plain]
75	Large-Message Nonblocking MPI_Iallgather and MPI_Ibcast Offload via BlueField-2 DPU N. Sarkauskas, M. Bayatpour, T. Tran, B. Ramesh, H. Subramoni, and DK Panda, 28th IEEE International Conference on High Performance Computing, Data, Analytics, and Data Science, Dec 2021 [Short Paper] [Bib - Plain]
76	Accelerating CPU-based Distributed DNN Training on Modern HPC Clusters using BlueField-2 DPUs A. Jain, N. Alnaasan, A. Shafi, H. Subramoni, and DK Panda, 28th IEEE Hot Interconnects, Aug 2021 [Bib - Plain]
77	INAM: Cross-stack Profiling and Analysis of Communication in MPI-based Applications P. Kousha, K. Raj, M. Kedia, H. Subramoni, A. Jain, A. Shafi, DK Panda, Trey Dockendorf, Heechang Na, and K. Tomko, Practice and Experience in Advanced Research Computing 2021, Jul 2021 [Bib - Plain]
78	BluesMPI: Efficient MPI Non-blocking Alltoall Offloading Designs on Modern BlueField Smart NICs M. Bayatpour, N. Sarkauskas, H. Subramoni, J. Hashmi, and DK Panda, ISC HIGH PERFORMANCE 2021, Jun 2021 [Bib - Plain]
79	Designing a ROCm-aware MPI Library for AMD GPUs: Early Experiences K. Khorassani, J. Hashmi, C. Chu, C. Chen, H. Subramoni, and DK Panda, ISC HIGH PERFORMANCE 2021, Jun 2021 [Bib - Plain]
80	Scaling Single-Image Super-Resolution Training on Modern HPC Clusters: Early Experiences Q. Anthony, L. Xu, H. Subramoni, and DK Panda, Scalable Deep Learning over Parallel And Distributed Infrastructures, May 2021 [Bib - Plain]
81	SUPER: SUb-Graph Parallelism for TransformERs A. Jain, T. Moon, T. Benson, H. Subramoni, S. Jacobs, DK Panda, and B. Essen, 35th IEEE International Parallel & Distributed Processing Symposium, May 2021 [Bib - Plain]
82	Designing High-Performance MPI Libraries with On-the-fly Compression for Modern GPU Clusters Q. Zhou, C. Chu, N. Senthil Kumar, P. Kousha, M. Ghazimirsaeed, H. Subramoni, and DK Panda, 35th IEEE International Parallel & Distributed Processing Symposium, May 2021 [Best Paper Finalist] [Bib - Plain]
83	Adaptive and Hierarchical Large Message All-to-all Communication Algorithms for Large-scale Dense GPU Systems K. Khorassani, C. Chu, Q. Anthony, H. Subramoni, and DK Panda, The 21st IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing, May 2021 [Bib - Plain]
84	Efficient MPI-based Communication for GPU-Accelerated Dask Applications A. Shafi, J. Hashmi, H. Subramoni, and DK Panda, The 21st IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing, May 2021 [Bib - Plain]
85	Blink: Towards Efficient RDMA-based Communication Coroutines for Parallel Python Applications A. Shafi, J. Hashmi, H. Subramoni, and DK Panda, 27TH IEEE INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING, DATA, AND ANALYTICS, Dec 2020 [Bib - Plain]
86	A Hierarchical and Load-Aware Design for Large Message Neighborhood Collectives M. Ghazimirsaeed, Q. Zhou, A. Ruhela, M. Bayatpour, H. Subramoni, and DK Panda, SC 2020, Nov 2020 [Bib - Plain]
87	GEMS: GPU Enabled Memory Aware Model Parallelism System for Distributed DNN Training A. Jain, Ammar Awan, A. Aljuhani, J. Hashmi, Q. Anthony, H. Subramoni, DK Panda, R. Machiraju, and A. Parwani, SC 2020, Nov 2020 [Bib - Plain]
88	Scalable MPI Collectives using SHARP: Large Scale Performance Evaluation on the TACC Frontera System B. Ramesh, K. Suresh, N. Sarkauskas, M. Bayatpour, J. Hashmi, H. Subramoni, and DK Panda, ExaMPI2020 - Workshop on Exascale MPI 2020, Nov 2020 [Bib - Plain]
89	MPI Meets Cloud: Case Study with Amazon EC2 and Microsoft Azure S. Xu, M. Ghazimirsaeed, J. Hashmi, H. Subramoni, and DK Panda, 4th Workshop on Emergine Parallel and Distributed Runtime Systems and Middlewares, Nov 2020 [Bib - Plain]
90	Design and Characterization of Infiniband Hardware Tag Matching in MPI M. Bayatpour, M. Ghazimirsaeed, S. Xu, H. Subramoni, and DK Panda, The 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing, Nov 2020 [Bib - Plain]
91	Accelerating GPU-based Machine Learning in Python using MPI Library: A Case Study with MVAPICH2-GDR M. Ghazimirsaeed, Q. Anthony, A. Shafi, H. Subramoni, and DK Panda, 6th Workshop on Machine Learning in HPC Environments, Nov 2020 [Bib - Plain]
92	Dynamic Kernel Fusion for Bulk Non-contiguous Data Transfer on GPU Clusters C. Chu, K. Khorassani, Q. Zhou, H. Subramoni, and DK Panda, 22nd IEEE International Conference on Cluster Computing (IEEE Cluster 2020), Sep 2020 [Bib - Plain]
93	Accelerated Real-time Network Monitoring and Profiling at Scale using OSU INAM P. Kousha, K. Raj, H. Subramoni, DK Panda, H. Na, T. Dockendorf, and K. Tomko, Practice and Experience in Advanced Research Computing 2020, Jul 2020 [Bib - Plain]
94	NV-Group: Link-Efficient Reductions for Distributed Deep Learning on Modern Dense GPU Systems C. Chu, P. Kousha, Ammar Awan, K. Khorassani, H. Subramoni, and DK Panda, The 34th ACM International Conference on Supercomputing (ICS-2020), Jun 2020 [Bib - Plain]
95	Communication-Aware Hardware-Assisted MPI Overlap Engine M. Bayatpour, J. Hashmi, S. Chakraborty, K. Suresh, M. Ghazimirsaeed, B. Ramesh, H. Subramoni, and DK Panda, ISC HIGH PERFORMANCE 2020, Jun 2020 [Bib - Plain]
96	HyPar-Flow: Exploiting MPI and Keras for Scalable Hybrid-Parallel DNN Training with TensorFlow Ammar Awan, A. Jain, Q. Anthony, H. Subramoni, and DK Panda, ISC HIGH PERFORMANCE 2020, Jun 2020 [Bib - Plain]
97	OSU INAM: Profiling and Visualization Tool for Scalable and In-Depth Analysis of High-Performance GPU-enabled HPC Clusters P. Kousha, K. Raj, H. Subramoni, and DK Panda, ISC HIGH PERFORMANCE 2020, Jun 2020 [Best Poster Award] [Bib - Plain]
98	Efficient Training of Semantic Image Segmentation on Summit using Horovod and MVAPICH2-GDR Q. Anthony, Ammar Awan, A. Jain, H. Subramoni, and DK Panda, Scalable Deep Learning over Parallel and Distributed Infrastructures (ScaDL) at IPDPS '20, May 2020 [Bib - Plain]
99	Machine-agnostic and Communication-aware Designs for MPI on Emerging Architectures J. Hashmi, S. Xu, B. Ramesh, M. Bayatpour, H. Subramoni, and DK Panda, 34th IEEE International Parallel & Distributed Processing Symposium (IPDPS '20), May 2020 [Bib - Plain]
100	Performance Characterization of Network Mechanisms for Non-Contiguous Data Transfers in MPI K. Suresh, B. Ramesh, M. Ghazimirsaeed, M. Bayatpour, J. Hashmi, H. Subramoni, and DK Panda, Workshop on Scalable Networks for Advanced Computing Systems (SNACS) at IPDPS '20, May 2020 [Bib - Plain]
101	Analyzing and Understanding the Impact of Interconnect Performance on HPC, Big Data, and Deep Learning Applications: A Case Study with InfiniBand EDR and HDR A. Ruhela, S. Xu, K. Vadambacheri Manian, H. Subramoni, and DK Panda, Workshop on Scalable Networks for Advanced Computing Systems (SNACS) at IPDPS '20, May 2020 [Bib - Plain]
102	High-Performance Adaptive MPI Derived Datatype Communication for Modern Multi-GPU Systems C. Chu, J. Hashmi, K. Khorassani, H. Subramoni, and DK Panda, 26th IEEE International Conference on High Performance Computing, Data, Analytics and Data Science (HiPC '19), Dec 2019 [Bib - Plain]
103	Designing a Profiling and Visualization Tool for Scalable and In-Depth Analysis of High-Performance GPU Clusters P. Kousha, B. Ramesh, K. Suresh, C. Chu, A. Jain, N. Sarkauskas, H. Subramoni, and DK Panda, 26th IEEE International Conference on High Performance Computing, Data, Analytics and Data Science (HiPC '19), Dec 2019 [Bib - Plain]
104	Design and Evaluation of Shared Memory CommunicationBenchmarks on Emerging Architectures using MVAPICH2 S. Xu, J. Hashmi, S. Chakraborty, H. Subramoni, and DK Panda, Third Annual Workshop on Emerging Parallel and Distributed Runtime Systems and Middleware, Nov 2019 [Bib - Plain]
105	Leveraging Network-level parallelism with Multiple Process-Endpoints for MPI Broadcast A. Ruhela, B. Ramesh, S. Chakraborty, H. Subramoni, J. Hashmi, and DK Panda, Third Annual Workshop on Emerging Parallel and Distributed Runtime Systems and Middleware, Nov 2019 [Bib - Plain]
106	OMB-UM: Design, Implementation, and Evaluation of CUDA Unified Memory Aware MPI Benchmarks K. Vadambacheri Manian, C. Chu, Ammar Awan, K. Khorassani, H. Subramoni, and DK Panda, 10th International Workshop in Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems, Nov 2019 [Bib - Plain]
107	Scaling TensorFlow, PyTorch, and MXNet using MVAPICH2 for High-Performance Deep Learning on Frontera A. Jain, Ammar Awan, H. Subramoni, and DK Panda, 3rd Deep Learning on Supercomputers Workshop (DLS) at SC19, Nov 2019 [Bib - Plain]
108	Performance Characterization of DNN Training using TensorFlow and PyTorch on Modern Clusters A. Jain, Ammar Awan, Q. Anthony, H. Subramoni, and DK Panda, 21st IEEE International Conference on Cluster Computing, Sep 2019 [Bib - Plain]
109	Communication Profiling and Characterization of Deep-Learning Workloads on Clusters With High-Performance Interconnects Ammar Awan, A. Jain, C. Chu, H. Subramoni, and DK Panda, 26th Symposium on High-Performance Interconnects (HotI '19), Aug 2019 [Bib - Plain]
110	Designing Scalable and High-performance MPI Libraries on Amazon Elastic Fabric Adapter S. Chakraborty, S. Xu, H. Subramoni, and DK Panda, HOT Interconnects 26, Aug 2019 [Bib - Plain]
111	Performance Evaluation of MPI Libraries on GPU-enabled OpenPOWER Architectures: Early Experiences K. Khorassani, C. Chu, H. Subramoni, and DK Panda, International Workshop on OpenPOWER for HPC, held in conjunction with ISC'19, Jun 2019 [Bib - Plain]
112	Reduction Operations on Modern Supercomputers: Challenges and Solutions M. Bayatpour, J. Hashmi, S. Chakraborty, H. Subramoni, and DK Panda, ISC HIGH PERFORMANCE 2019, Jun 2019 [Best Poster Award] [Bib - Plain]
113	FALCON: Efficient Designs for Zero-copy MPI Datatype Processing on Emerging Architectures J. Hashmi, S. Chakraborty, M. Bayatpour, H. Subramoni, and DK Panda, 33rd IEEE International Parallel & Distributed Processing Symposium (IPDPS '19), May 2019 [Best Paper Finalist] [Bib - Plain]
114	Design and Characterization of Shared Address Space MPI Collectives on Modern Architectures J. Hashmi, S. Chakraborty, M. Bayatpour, H. Subramoni, and DK Panda, The 19th Annual IEEE/ACM International Symposium in Cluster, Cloud, and Grid Computing (CCGRID 2019), May 2019 [Bib - Plain]
115	Scalable Distributed DNN Training using TensorFlow and CUDA-Aware MPI: Characterization, Designs, and Performance Evaluation Ammar Awan, J. Bedorf, C. Chu, H. Subramoni, and DK Panda, The 19th Annual IEEE/ACM International Symposium in Cluster, Cloud, and Grid Computing (CCGRID 2019), May 2019 [Bib - Plain]
116	OC-DNN: Exploiting Advanced Unified Memory Capabilities in CUDA 9 and Volta GPUs for Out-of-Core DNN Training Ammar Awan, C. Chu, H. Subramoni, X. Lu, and DK Panda, 25th IEEE International Conference on High Performance Computing, Data, and Analytics, Dec 2018 [Bib - Plain]
117	Cooperative Rendezvous Protocols for Improved Performance and Overlap S. Chakraborty, M. Bayatpour, J. Hashmi, H. Subramoni, and DK Panda, 2018 The International Conference for High Performance Computing, Networking, Storage, and Analysis, Nov 2018 [Best Student Paper Finalist] [Bib - Plain]
118	Efficient Asynchronous Communication Progress for MPI without Dedicated Resources A. Ruhela, H. Subramoni, S. Chakraborty, M. Bayatpour, P. Kousha, and DK Panda, The EuroMPI 2018 Conference, Sep 2018 [Bib - Plain]
119	Optimized Broadcast for Deep Learning Workloads on Dense-GPU InfiniBand Clusters: MPI or NCCL? Ammar Awan, C. Chu, H. Subramoni, and DK Panda, The EuroMPI 2018 Conference, Sep 2018 [Bib - Plain]
120	Multi-Threading and Lock-Free MPI RMA Based Graph Processing on KNL and POWER Architectures M. Li, X. Lu, H. Subramoni, and DK Panda, The EuroMPI 2018 Conference, Sep 2018 [Bib - Plain]
121	SALaR: Scalable and Adaptive Designs for Large Message Reduction Collectives M. Bayatpour, J. Hashmi, S. Chakraborty, H. Subramoni, P. Kousha, and DK Panda, IEEE Cluster 2018, Sep 2018 [Best Paper Award] [Bib - Plain]
122	Designing Efficient Shared Address Space Reduction Collectives for Multi-/Many-cores J. Hashmi, S. Chakraborty, M. Bayatpour, H. Subramoni, and DK Panda, 32nd IEEE International Parallel & Distributed Processing Symposium (IPDPS '18), May 2018 [Bib - Plain]
123	Kernel-assisted Communication Engine for MPI on Emerging Manycore Processors J. Hashmi, K. Hamidouche, H. Subramoni, and DK Panda, 24th IEEE International Conference on High Performance Computing, Data, and Analytics (HiPC'17), Dec 2017 [Bib - Plain]
124	Designing Registration Caching Free High-Performance MPI Library with Implicit On-Demand Paging (ODP) of InfiniBand M. Li, X. Lu, H. Subramoni, and DK Panda, 24th IEEE International Conference on High Performance Computing, Data, and Analytics (HiPC'17), Dec 2017 [Bib - Plain]
125	An In-depth Performance Characterization of CPU- and GPU-based DNN Training on Modern Architectures Ammar Awan, H. Subramoni, and DK Panda, 3rd Workshop on Machine Learning in High Performance Computing Environments, held in conjunction with SC17, Nov 2017 [Bib - Plain]
126	Scalable Reduction Collectives with Data Partitioning-based Multi-Leader Design M. Bayatpour, S. Chakraborty, H. Subramoni, X. Lu, and DK Panda, SuperComputing 2017, Nov 2017 [Bib - Plain]
127	Performance of PGAS Models on KNL: A Comprehensive Study with MVAPICH2-X J. Hashmi, M. Li, H. Subramoni, and DK Panda, Intel Xeon Phi User's Group (IXPUG) 2017, Sep 2017 [Bib - Plain]
128	Advancing MPI Libraries to the Many-core Era: Designs and Evaluations with MVAPICH2 S. Chakraborty, M. Bayatpour, H. Subramoni, and DK Panda, Intel Xeon Phi User's Group (IXPUG) 2017, Sep 2017 [Bib - Plain]
129	Contention Aware Kernel-Assisted MPI Collectives for Multi/Many-core Systems S. Chakraborty, H. Subramoni, and DK Panda, 2017 IEEE International Conference on Cluster Computing, Sep 2017 [Best Paper Finalist] [Bib - Plain]
130	Efficient and Scalable Multi-Source Streaming Broadcast on GPU Clusters for Deep Learning C. Chu, X. Lu, Ammar Awan, H. Subramoni, J. Hashmi, Bracy Elton, and DK Panda, ICPP 2017 : International Conference on Parallel Processing, Aug 2017 [Bib - Plain]
131	Exploiting and Evaluating OpenSHMEM on KNL Architecture J. Hashmi, M. Li, H. Subramoni, and DK Panda, Fourth Workshop on OpenSHMEM and Related Technologies, Aug 2017 [Bib - Plain]
132	Designing Dynamic and Adaptive MPI Point-to-point Communication Protocols for Efficient Overlap of Computation and Communication H. Subramoni, S. Chakraborty, and DK Panda, International Supercomputing Conference (ISC ’17), Jun 2017 [Hans Meuer Award (Most Outstanding Research Paper)] [Bib - Plain]
133	Impact of HPC Cloud Networking Technologies on Accelerating Hadoop RPC and HBase X. Lu, D. Shankar, S. Gugnani, H. Subramoni, and DK Panda, 8th IEEE International Conference on Cloud Computing Technology and Science (IEEE CloudCom '16), Dec 2016 [Bib - Plain]
134	Efficient Reliability Support for Hardware Multicast-based Broadcast in GPU-enabled Streaming Applications C. Chu, K. Hamidouche, H. Subramoni, A. Venkatesh, B. Elton, and DK Panda, First Workshop on Optimization of Communication in HPC runtime systems (COMHPC, SC Workshop), Nov 2016 [Bib - Plain]
135	Designing MPI Library with On-Demand Paging (ODP) of InfiniBand: Challenges and Benefits M. Li, K. Hamidouche, X. Lu, H. Subramoni, J. Zhang, and DK Panda, SuperComputing 2016, Nov 2016 [Bib - Plain]
136	Designing High Performance Heterogeneous Broadcast for Streaming Applications on GPU Clusters C. Chu, K. Hamidouche, H. Subramoni, A. Venkatesh, B. Elton, and DK Panda, 28th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD'16), Oct 2016 [Bib - Plain]
137	Adaptive and Dynamic Design for MPI Tag Matching M. Bayatpour, H. Subramoni, S. Chakraborty, and DK Panda, IEEE Cluster 2016, Sep 2016 [Best Paper Nominee] [Bib - Plain]
138	INAM^2: InfiniBand Network Analysis & Monitoring with MPI H. Subramoni, A. Augustine, M. Arnold, J. Perkins, X. Lu, K. Hamidouche, and DK Panda, International Supercomputing Conference, Jun 2016 [Slides] [Bib - Plain]
139	Exploiting Maximal Overlap for Non-Contiguous Data Movement Processing on Modern GPU-enabled System C. Chu, K. Hamidouche, A. Venkatesh, D. Banerjee, H. Subramoni, and DK Panda, The 30th IEEE International Parallel & Distributed Processing Symposium (IPDPS '16), May 2016 [Bib - Plain]
140	SHMEMPMI - Shared Memory based PMI for Improved Performance and Scalability S. Chakraborty, H. Subramoni, J. Perkins, and DK Panda, 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid'16), May 2016 [Bib - Plain]
141	Offloaded GPU Collectives using CORE-Direct and CUDA Capabilities on IB Clusters A. Venkatesh, K. Hamidouche, H. Subramoni, and DK Panda, 22nd IEEE International Conference on High Performance Computing, Dec 2015 [Bib - Plain]
142	GPU-Aware Design, Implementation, and Evaluation of Non-blocking Collective Benchmarks Ammar Awan, K. Hamidouche, A. Venkatesh, J. Perkins, H. Subramoni, and DK Panda, EuroMPI 2015, Sep 2015 [Bib - Plain]
143	High Performance MPI Datatype Support with User-mode Memory Registration: Challenges, Designs and Benefits M. Li, H. Subramoni, K. Hamidouche, X. Lu, and DK Panda, IEEE Cluster 2015, Sep 2015 [Bib - Plain]
144	Exploiting GPUDirect RDMA in Designing High Performance OpenSHMEM for NVIDIA GPU Clusters K. Hamidouche, A. Venkatesh, Ammar Awan, H. Subramoni, and DK Panda, IEEE Cluster 2015, Sep 2015 [Bib - Plain]
145	Impact of InfiniBand DC Transport Protocol on Energy Consumption of All-to-all Collective Algorithms H. Subramoni, A. Venkatesh, K. Hamidouche, K. Tomko, and DK Panda, 23rd International Symposium on High Performance Interconnects 2015, Aug 2015 [Bib - Plain]
146	Designing Non-Blocking Personalized Collectives with Near Perfect Overlap for RDMA-Enabled Clusters H. Subramoni, Ammar Awan, K. Hamidouche, D. Pekurovsky, A. Venkatesh, S. Chakraborty, K. Tomko, and DK Panda, ISC '15, Jul 2015 [Bib - Plain]
147	On-demand Connection Management for OpenSHMEM and OpenSHMEM+MPI S. Chakraborty, H. Subramoni, J. Perkins, Ammar Awan, and DK Panda, HIPS '15 (IPDPS Workshop), May 2015 [Bib - Plain]
148	Non-blocking PMI Extensions for Fast MPI Startup S. Chakraborty, H. Subramoni, A. Moody, A. Venkatesh, J. Perkins, and DK Panda, CCGrid '15, May 2015 [Bib - Plain]
149	A High Performance Broadcast Design with Hardware Multicast and GPUDirect RDMA for Streaming Applications on Infiniband Clusters A. Venkatesh, H. Subramoni, K. Hamidouche, and DK Panda, IEEE International Conference on High Performance Computing (HiPC ’14), Dec 2014 [Bib - Plain]
150	Designing Scalable Out-of-core Sorting with Hybrid MPI+PGAS Programming Models J. Jose, S. Potluri, H. Subramoni, X. Lu, K. Hamidouche, K. Schulz, H. Sundar, and DK Panda, International Conference on Partitioned Global Address Space Programming Models (PGAS '14), Oct 2014 [Bib - Plain]
151	PMI Extensions for Scalable MPI Startup S. Chakraborty, H. Subramoni, J. Perkins, A. Moody, M. Arnold, and DK Panda, EuroMPI/ASIA 2014, Sep 2014 [Bib - Plain]
152	Designing Topology-Aware Communication Schedules for Alltoall Operations in Large InfiniBand Clusters H. Subramoni, K. Kandalla, J. Jose, K. Tomko, K. Schulz, D. Pekurovsky, and DK Panda, International Conference on Parallel Processing (ICPP’14), Sep 2014 [Bib - Plain]
153	Designing MPI Library with Dynamic Connected Transport (DCT) of InfiniBand : Early Experiences H. Subramoni, K. Hamidouche, A. Venkatesh, S. Chakraborty, and DK Panda, IEEE International Supercomputing Conference (ISC ’14), Jun 2014 [Bib - Plain]
154	MVAPICH-PRISM: A Proxy-based Communication Framework using InfiniBand and SCIF for Intel MIC Clusters S. Potluri, D. Bureddy, K. Hamidouche, A. Venkatesh, K. Kandalla, H. Subramoni, and DK Panda, Internationall Conference on Supercomputing (SC 2013), Nov 2013 [Bib - Plain]
155	High-Performance Design of Hadoop RPC with RDMA over InfiniBand X. Lu, N. Islam, M. W. Rahman, J. Jose, H. Subramoni, H. Wang, and DK Panda, International Conference on Parallel Processing (ICPP '13), Oct 2013 [Bib - Plain]
156	A Novel Functional Partitioning Approach to Design High-Performance MPI-3 Non-Blocking Alltoallv Collective on Multi-core Systems K. Kandalla, H. Subramoni, K. Tomko, D. Pekurovsky, and DK Panda, International Conference on Parallel Processing (ICPP '13), Oct 2013 [Bib - Plain]
157	Design of Network Topology Aware Scheduling Services for Large InfiniBand Clusters H. Subramoni, D. Bureddy, K. Kandalla, K. Schulz, B. Barth, J. Perkins, M. Arnold, and DK Panda, IEEE Cluster (Cluster '13), Sep 2013 [Bib - Plain]
158	MIC-RO: Enabling Efficient Remote Offload on Heterogeneous Many Integrated Core (MIC) Clusters with InfiniBand K. Hamidouche, S. Potluri, H. Subramoni, K. Kandalla, and DK Panda, International Conference on Supercomputing (ICS '13), Jun 2013 [Bib - Plain]
159	High-Performance RDMA-based Design of Hadoop MapReduce over InfiniBand M. W. Rahman, N. Islam, X. Lu, J. Jose, H. Subramoni, H. Wang, and DK Panda, International Workshop on High Performance Data Intensive Computing (HPDIC), May 2013 [Bib - Plain]
160	Extending OpenSHMEM for GPU Computing S. Potluri, D. Bureddy, H. Wang, H. Subramoni, and DK Panda, International Parallel and Distributed Processing Symposium (IPDPS '13), May 2013 [Slides] [Bib - Plain]
161	High Performance RDMA-Based Design of HDFS over InfiniBand N. Islam, M. W. Rahman, J. Jose, R. Rajachandrasekar, H. Wang, H. Subramoni, C. Murthy, and DK Panda, International Conference on Supercomputing (SC '12), Nov 2012 [Slides] [Bib - Plain]
162	Design of a Scalable InfiniBand Topology Service to Enable Network-Topology-Aware Placement of Processes H. Subramoni, S. Potluri, K. Kandalla, B. Barth, J. Vienne, J. Keasler, K. Tomko, K. Schulz, A. Moody, and DK Panda, International Conference on Supercomputing (SC '12), Nov 2012 [Bib - Plain]
163	Minimizing Network Contention in InfiniBand Clusters with a QoS-Aware Data-Staging Framework R. Rajachandrasekar, J. Jaswani, H. Subramoni, and DK Panda, IEEE Cluster (Cluster '12), Sep 2012 [Bib - Plain]
164	Can Network-Offload based Non-Blocking Neighborhood MPI Collectives Improve Communication Overheads of Irregular Graph Algorithms? Int'l Workshop on Parallel Algorithm and Parallel Software (IWPAPS12) K. Kandalla, H. Subramoni, K. Tomko, J. Vienne, L. Oliker, and DK Panda, held in conjunction with IEEE Cluster (Cluster '12), Sep 2012 [Bib - Plain]
165	A Scalable InfiniBand Network-Topology-Aware Performance Analysis Tool for MPI H. Subramoni, J. Vienne, and DK Panda, International Workshop on Productivity and Performance (Proper '12), Aug 2012 [Bib - Plain]
166	Performance Analysis and Evaluation of InfiniBand FDR and 40GigE RoCE on HPC and Cloud Computing System J. Vienne, J. Chen, M. W. Rahman, N. Islam, H. Subramoni, and DK Panda, International Symposium on High-Performance Interconnects (HotI 2012), Aug 2012 [Bib - Plain]
167	High-Performance Design of HBase with RDMA over InfiniBand J. Huang, X. Ouyang, J. Jose, M. W. Rahman, H. Wang, M. Luo, H. Subramoni, C. Murthy, and DK Panda, International Parallel and Distributed Processing Symposium (IPDPS '12), May 2012 [Bib - Plain]
168	Designing Non-blocking Allreduce with Collective Offload on InfiniBand Clusters: A Case Study with Conjugate Gradient Solvers K. Kandalla, U. Yang, J. Keasler, T. Kolev, A. Moody, H. Subramoni, K. Tomko, J. Vienne, and DK Panda, International Parallel and Distributed Processing Symposium 2012, May 2012 [Bib - Plain]
169	Designing Network Failover and Recovery in MPI for Multi-Rail InfiniBand Clusters S. P. Raikar, H. Subramoni, K. Kandalla, J. Vienne, and DK Panda, International Workshop on System Management Techniques, May 2012 [Bib - Plain]
170	Understanding the Communication Characteristics in HBase: What are the Fundamental Bottlenecks? M. W. Rahman, J. Huang, J. Jose, X. Ouyang, H. Wang, N. Islam, H. Subramoni, C. Murthy, and DK Panda, International Symposium on Performnce Analysis of Systems and Software (ISPASS '12), Poster Paper, Apr 2012 [Bib - Plain]
171	Memcached Design on High Performance RDMA Capable Interconnects J. Jose, H. Subramoni, M. Luo, M. Zhang, J. Huang, M. W. Rahman, N. Islam, X. Ouyang, H. Wang, S. Sur, and DK Panda, International Conference on Parallel Processing (ICPP '11), Sep 2011 [Slides] [Bib - Plain]
172	Design and Evaluation of Network Topology-/Speed- Aware Broadcast Algorithms for InfiniBand Clusters H. Subramoni, K. Kandalla, J. Vienne, S. Sur, B. Barth, K. Tomko, R. McLay, K. Schulz, and DK Panda, IEEE Cluster '11, Sep 2011 [Bib - Plain]
173	INAM - A Scalable InfiniBand Network Analysis and Monitoring Tool N. Dandapanthula, H. Subramoni, J. Vienne, K. Kandalla, S. Sur, DK Panda, and R. Brightwell, 4th International Workshop on Productivity and Performance (PROPER 2011), Aug 2011 [Slides] [Bib - Plain]
174	Designing Non-blocking Broadcast with Collective Offload on InfiniBand Clusters: A Case Study with HPL K. Kandalla, H. Subramoni, J. Vienne, K. Tomko, S. Sur, and DK Panda, Hot Interconnect '11, Aug 2011 [Bib - Plain]
175	High-Performance and Scalable Non-Blocking All-to-All with Collective Offload on InfiniBand Clusters: A Study with Parallel 3D FFT K. Kandalla, H. Subramoni, K. Tomko, D. Pekurovsky, S. Sur, and DK Panda, International Supercomputing Conference '11 (ISC'11), Jun 2011 [Bib - Plain]
176	Scalable Memcached design for InfiniBand Clusters using Hybrid Transports J. Jose, H. Subramoni, K. Kandalla, M. W. Rahman, H. Wang, S. Narravula, and DK Panda, International Symposium on Cluster, May 2011 [Bib - Plain]
177	Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters H. Subramoni, P. Lai, S. Sur, and DK Panda, International Conference on Parallel Processing (ICPP '10), Sep 2010 [Slides] [Bib - Plain]
178	High Performance Design and Implementation of Nemesis Communication Layer for Two-sided and One-Sided MPI Semantics in MVAPICH2 M. Luo, S. Potluri, P. Lai, E. Mancini, H. Subramoni, K. Kandalla, S. Sur, and DK Panda, International Workshop on Parallel Programming Models and Systems Software for High-End Computing (P2S2 '10), Sep 2010 [Bib - Plain]
179	Design and Evaluation of Generalized Collective Communication Primitives with Overlap using ConnectX-2 Offload Engine H. Subramoni, K. Kandalla, S. Sur, and DK Panda, International Symposium on High Performance Interconnects 2010, Aug 2010 [Bib - Plain]
180	High Performance Data Transfer in Grid Environment Using GridFTP over InfiniBand H. Subramoni, P. Lai, R. Kettimuthu, and DK Panda, 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid'10), May 2010 [Slides] [Bib - Plain]
181	Designing Topology-Aware Collective Communication Algorithms for Large Scale InfiniBand Clusters: Case Studies with Scatter and Gather K. Kandalla, H. Subramoni, A. Vishnu, and DK Panda, International Workshop on Communication Architecture for Clusters (CAC 10), Apr 2010 [Bib - Plain]
182	Designing Efficient FTP Mechanisms for High Performance Data-Transfer over InfiniBand P. Lai, H. Subramoni, S. Narravula, A. Mamidala, and DK Panda, International Conference on Parallel Processing (ICPP '09), Sep 2009 [Slides] [Bib - Plain]
183	RDMA over Ethernet - A Preliminary Study H. Subramoni, P. Lai, M. Luo, and DK Panda, International Workshop on High Performance Distributed Computing (HPI-DC '09), Sep 2009 [Slides] [Bib - Plain]
184	Designing Multi-Leader-Based Allgather Algorithms for Multi-Core Clusters K. Kandalla, H. Subramoni, G. Santhanaraman, and DK Panda, International Workshop on Communication Architecture for Clusters (CAC'09), May 2009 [Slides] [Bib - Plain]
185	Design and Evaluation of Benchmarks for Financial Applications using Advanced Message Queuing Protocol (AMQP) over InfiniBand H. Subramoni, G. Marsh, S. Narravula, P. Lai, and DK Panda, Workshop on High Performance Computational Finance (In conjunction with SC '08), Nov 2008 [OSU Technical Report Version (OSU-CISRC-10/08-TR51)] [Bib - Plain]
186	Performance of HPC middleware over InfiniBand WAN S. Narravula, H. Subramoni, P. Lai, R. Noronha, and DK Panda, International Conference on Parallel Processing 2008, Sep 2008 [Bib - Plain]
187	Designing Next Generation Clusters: Evaluation of InfiniBand DDR/QDR on Intel Computing Platforms H. Subramoni, M. Koop, and DK Panda, International Symposium on Hot Interconnects (HotI), Aug 2007 [Slides] [Bib - Plain]

Ph.D. Disserations (1)
1	H. Subramoni, Topology-Aware MPI communication and Scheduling for High Performance Computing Systems, Jul 2013

NOWLAB: Network Based Computing Lab

This page lists the publications by Hari Subramoni

Journals (17)

Conferences & Workshops (187)

Design and Implementation of Casting Compression for GPU-Aware MPI Collectives

From Skew to Symmetry: Node-Interconnect Multi-Path Balancing with Execution-time Planning for Modern GPU Clusters

Design and Optimization of GPU-Aware MPI Allreduce Using Direct Sendrecv Communication

A Streaming Collectives Interface Targeting Dataflow Acceleration and HPC Workloads

Design and Optimization of GPU-Aware MPI Allreduce Using Direct Sendrecv Communication

Towards Dynamic Message Passing Protocols for Stencil-Based Communication Patterns

OMB-Compr: An Extension to OSU Micro Benchmarks for Collective Compression Error Measurement

Use of BlueField-SmartNICs in Offloading One-Sided Communication Primitives

Design and Implementation of MPI Collective Operations for Large Message Communication on AMD GPUs

Design and Implementation of a GPU-Aware MPI Collective Library for Intel GPUs

Unified Designs of Multi-rail-aware MPI Allreduce and Alltoall Operations Across Diverse GPU and Interconnect Systems

Training ultra long context language model with fully pipelined distributed transformer

Effective and Efficient Offloading Designs for One-Sided Communication to SmartNICs

Using BlueField-3 SmartNICs to Offload Vector Operations in Krylov Subspace Methods

Design and Implementation of Kernel-based MPI Reduction Operations for Intel GPUs

Scaling Large Language Model Training on Frontier with Low-Bandwidth Partitioning

HyperSack: Distributed Hyperparameter Optimization for Deep Learning using Resource-Aware Scheduling on Heterogeneous GPU Systems

HARVEST-2.0: High-Performance Vision Framework for End-to-end Preprocessing, Training, Inference, and Visualization

Demystifying the Communication Characteristics for Distributed Transformer Models

Characterizing Communication in Distributed Parameter-Efficient Fine-Tuning for Large Language Models

OHIO: Improving RDMA Network Scalability in MPI_Alltoall through Optimized Hierarchical and Intra/Inter-Node Communication Overlap Design

The Case for Co-Designing Model Architectures with Hardware

Design and Implementation of an IPC-based Collective MPI Library for Intel GPUs

OMB-CXL: A Micro-Benchmark Suite for Evaluating MPI Communication Utilizing Compute Express Link Memory Devices

A Novel LLM-enabled Framework for Accelerating the Creation of Knowledge Graphs for HPC

OMB-FPGA: A Microbenchmark Suite for FPGA-aware MPIs using OpenCL and SYCL

Infer-HiRes: Accelerating Inference for High-Resolution Images with Quantization and Distributed Deep Learning

PML-MPI: A Pre-Trained ML Framework for Efficient Collective Algorithm Selection in MPI

Exploiting Inter-Layer Expert Affinity for Accelerating Mixture-of-Experts Model Inference

HINT: Designing Cache-Efficient MPI_Alltoall using Hybrid Memory Copy Ordering and Non-Temporal Instructions

Accelerating MPI AllReduce Communication with Efficient GPU-Based Compression Schemes on Modern GPU Clusters

Profiling, Storing and Monitoring HPC Communication Data at Scale by OSU INAM

High-Performance Semi-Supervised Learning with HARVEST: A Distributed Computer Vision Framework for Expert Labeling

Accelerating Large Language Model Training with Hybrid GPU-based Compression

AccDP: Accelerated Data-Parallel Distributed DNN Training for Modern GPU-Based HPC Clusters

Optimized All-to-all Connection Establishment for High-Performance MPI Libraries over InifiniBand

Flover: A Temporal Fusion Framework for Efficient Autoregressive Model Parallel Inference

MPI Allgather Utilizing CXL Shared Memory Pool in Multi-Node Computing Systems

HARVEST: High-Performance Artificial Vision Framework for Expert Labeling using Semi-Supervised Training

MPI4Spark Meets YARN: Enhancing MPI4Spark through YARN support for HPC

Benchmarking Modern Databases for Storing and Profiling Very Large Scale HPC Communication Data

MPI-xCCL: A Portable MPI Library over Collective Communication Libraries for Various Accelerators

Democratizing HPC Access and Use with Knowledge Graphs

Battle of the BlueFields: An In-Depth Comparison of the BlueField-2 and BlueField-3 SmartNICs

DPU-Bench: A Micro-Benchmark Suite to Measure Offload Efficiency Of SmartNICs

Optimizing Amber for Device-to-Device GPU Communication

Enabling Reconfigurable HPC through MPI-based Inter-FPGA Communication

SAI: AI-Enabled Speech Assistant Interface for Science Gateways in HPC

Accelerating Distributed Deep Learning Training with Compression Assisted Allgather and Reduce-Scatter Communication

MCR-DL: Mix-and-Match Communication Runtime for Deep Learning

A Novel Framework for Efficient Offloading of Communication Operations to Bluefield SmartNICs

Designing and Optimizing GPU-aware Nonblocking MPI Neighborhood Collective Communication for PETSc

In-Depth Evaluation of a Lower-Level Direct-Verbs API on InfiniBand-based Clusters: Early Experiences

Implementing and Optimizing a GPU-aware MPI Library for Intel GPUs: Early Experiences

ScaMP: Scalable Meta-Parallelism for Deep Learning Search

Performance Characterization of using Quantization for DNN Inference on Edge Devices

AccDP: Accelerated Data-Parallel Distributed DNN Training for Modern GPU-Based HPC Clusters

Designing Efficient Pipelined Communication Schemes using Compression in MPI Libraries

Accelerating Broadcast Communication with GPU Compression for Deep Learning Workloads

Efficient Personalized and Non-Personalized Alltoall Communication for Modern Multi-HCA GPU-Based Clusters

Spark Meets MPI: Towards High-Performance Communication Framework for Spark using MPI

Designing Hierarchical Multi-HCA Aware Allgather in MPI

High Performance MPI over the Slingshot Interconnect: Early Experiences

Arm meets Cloud: A Case Study of MPI Library Performance on AWS Arm-based HPC Cloud with Elastic Fabric Adapter

Towards Java-based HPC using the MVAPICH2 Library: Early Experiences

Highly Efficient Alltoall and Alltoallv Communication Algorithms for GPU Systems

OMB-Py: Python Micro-Benchmarks for Evaluating Performance of MPI Libraries on HPC Systems

Hy-Fi: Hybrid Five-Dimensional Parallel DNN Training on High-Performance GPU Clusters

Accelerating MPI All-to-All Communication with Online Compression on Modern GPU Clusters

"Hey CAI" - Enhancing User Productivity through a Conversational AI Enabled User Interface for HPC Tools

OMB-Py: Python Micro-Benchmarks for Evaluating Performance of MPI Libraries and Machine Learning Applications on HPC Systems

DistMILE: A Distributed Multi-Level Framework for Scalable Graph Embedding

Towards Architecture-aware Hierarchical Communication Trees on Modern HPC Systems

Layout aware Hardware assisted Designs for Derived Data Types in MPI

Large-Message Nonblocking MPI_Iallgather and MPI_Ibcast Offload via BlueField-2 DPU

Accelerating CPU-based Distributed DNN Training on Modern HPC Clusters using BlueField-2 DPUs