NOWLAB :: Publications

Book (1)
1	DK Panda, X. Lu, and D. Shankar, High-Performance Big Data Computing, The MIT Press, Aug 2022.

Journals (42)
1	L. Xu, K. Suresh, Q. Anthony, N. Alnaasan, and DK Panda, Characterizing Communication Patterns in Distributed Large Language Model Inference, IEEE Micro, Feb 2026.
2	Q. Anthony, B. Michalowicz, J. Hatef, L. Xu, M. Abduljabbar, A. Shafi, H. Subramoni, and DK Panda, Understanding and Characterizing Communication Characteristics for Distributed Transformer Models, IEEE Micro, Jan 2025.
3	T. Tran, G. Kuncham, B. Ramesh, S. Xu, H. Subramoni, and DK Panda, OHIO: Enhancing RDMA Scalability in Alltoall with Optimized Communication Overlap, IEEE Micro, Jan 2025.
4	T. Tran, B. Ramesh, B. Michalowicz, M. Abduljabbar, H. Subramoni, A. Shafi, and DK Panda, Accelerating Communication with Multi-HCA Aware Collectives in MPI, Concurrency and Computation: Practice and Experience (CCPE), July 2023,
5	K. Suresh, K. Khorassani, C. Chen, B. Ramesh, M. Abduljabbar, A. Shafi, H. Subramoni, and DK Panda, Network Assisted Non-Contiguous Transfers for GPU-Aware MPI Libraries, IEEE Micro, Jan 2023.
6	K. Khorassani, C. Chen, B. Ramesh, A. Shafi, H. Subramoni, and DK Panda, High Performance MPI over the Slingshot Interconnect, Special Issue of Journal of Computer Science and Technology (JCST), Feb 2023.
7	DK Panda, H. Subramoni, C. Chu, and M. Bayatpour, The MVAPICH project: Transforming Research into High-Performance MPI Library for HPC Community , Journal of Computational Science (JOCS), Special Issue on Translational Computer Science, Oct 2020.
8	J. Hashmi, C. Chu, S. Chakraborty, M. Bayatpour, H. Subramoni, and DK Panda, FALCON-X: Zero-copy MPI Derived Datatype Processing on Modern CPU and GPU Architectures, Journal of Parallel and Distributed Computing (JPDC), Volume 144, October 2020, Pages 1-13, doi.org/10.1016/j.jpdc.2020.05.008,
9	Ammar Awan, A. Jain, C. Chu, H. Subramoni, and DK Panda, Communication Profiling and Characterization of Deep Learning Workloads on Clusters with High-Performance Interconnects, IEEE Micro, vol. 40, no. 1, pp. 35-43, 1 Jan.-Feb. 2020.,
10	A. Ruhela, H. Subramoni, S. Chakraborty, M. Bayatpour, P. Kousha, and DK Panda, Effcient Design for MPI Asynchronous Progress without Dedicated Resources, Parallel Computing - Systems & Applications, Volume 85, July 2019, Pages 13-26, https://doi.org/10.1016/j.parco.2019.03.003,
11	Ammar Awan, K. Vadambacheri Manian, C. Chu, H. Subramoni, and DK Panda, Optimized Large-Message Broadcast for Deep Learning Workloads: MPI, MPI+NCCL, or NCCL2?, Volume 85, July 2019, Pages 141-152, https://doi.org/10.1016/j.parco.2019.03.005,
12	C. Chu, X. Lu, Ammar Awan, H. Subramoni, Bracy Elton, and DK Panda, Exploiting Hardware Multicast and GPUDirect RDMA for Efficient Broadcast, IEEE Transactions on Parallel and Distributed Systems (TPDS), vol. 30, no. 3, pp. 575-588, 1 March 2019,
13	S. Chakraborty, Ignacio Laguna, Murali Emani, Kathryn Mohror, DK Panda, Martin Schulz, and H. Subramoni, EReinit: Scalable and Efficient Fault Tolerance for Bulk-Synchronous MPI Applications, Concurrency and Computation: Practice and Experience, 14 August 2018, https://doi.org/10.1002/cpe.4863,
14	X. Lu, H. Shi, R. Biswas, M. H. Javed, and DK Panda, DLoBD: A Comprehensive Study of Deep Learning over Big Data Stacks on HPC Clusters, IEEE Transactions on Multi-Scale Computing Systems, Jun 2018.
15	S. Ramesh, A. Mahéo, S. Shende, A. Malony, H. Subramoni, A. Ruhela, and DK Panda, MPI performance engineering with the MPI tool interface: The integration of MVAPICH and TAU, ISSN 0167-8191, Volume 77, Sep 2018.
16	M. W. Rahman, N. Islam, X. Lu, D. Shankar, and DK Panda, MR-Advisor: A Comprehensive Tuning, Profiling, and Prediction Tool for MapReduce Execution Frameworks on HPC Clusters, Journal of Parallel and Distributed Computing (JPDC), Nov 2017.
17	X. Lu, D. Shankar, and DK Panda, Scalable and Distributed Key-Value Store-based Data Management Using RDMA-Memcached, "IEEE Data Engineering Bulletin (DEBull), Volume 40", Bulletin of the Technical Committee on Data Engineering (TCDE), (Invited Paper), Mar 2017.
18	M. W. Rahman, N. Islam, X. Lu, and DK Panda, A Comprehensive Study of MapReduce over Lustre for Intermediate Data Placement and Shuffle Strategies on HPC Clusters, IEEE Transactions on Parallel and Distributed Systems, Jul 2016.
19	D. Shankar, X. Lu, M. W. Rahman, N. Islam, and DK Panda, Characterizing and benchmarking stand-alone Hadoop MapReduce on modern HPC clusters, The Journal of Supercomputing - Springer, Jun 2016.
20	K. Hamidouche, A. Venkatesh, Ammar Awan, H. Subramoni, and DK Panda, CUDA-Aware OpenSHMEM: Extensions and Designs for High Performance OpenSHMEM on GPU Clusters, ParCo: Elsevier Parallel Computing Journal ,
21	H. Wang, S. Potluri, D. Bureddy, and DK Panda, GPU-Aware MPI on RDMA-Enabled Cluster: Design, Implementation and Evaluation, IEEE Transactions on Parallel & Distributed Systems, Vol. 25, No. 10, pp. 2595-2605, Oct 2014.
22	N. Islam, X. Lu, M. W. Rahman, J. Jose, and DK Panda, A Micro-Benchmark Suite for Evaluating HDFS Operations on Modern Clusters, Special Issue of LNCS on papers from WBDB '12 Workshop, May 2012.
23	S. Sur, S. Potluri, K. Kandalla, H. Subramoni, K. Tomko, and DK Panda, Co-Designing MPI Library and Applications for InfiniBand Clusters IEEE Computer, Nov 2011.
24	P. Lai, P. Balaji, R. Thakur, and DK Panda, ProOnE: A General-Purpose Protocol Onload Engine for Multi- and Many-Core Architectures Computer Science: Research and Development, Special Issue of Scientific Papers from ISC '09, Jun 2009.
25	A. Vishnu, M. Koop, A. Moody, A. Mamidala, S. Narravula, and DK Panda, Topology Agnostic Hot-Spot Avoidance with InfiniBand Concurrency and Computation: Practice and Experience, Special Issue of Best Papers from CCGrid '07, Jan 2008.
26	H. Jin, P. Balaji, C. Yoo, J. -Y. Choi, and DK Panda, Exploiting NIC Architectural Support for Enhancing IP based Protocols on High Performance Networks OSU-CISRC-5/04-TR37, Nov 2005.
27	J. Liu, A. Mamidala, A. Vishnu, and DK Panda, Performance Evaluation of InfiniBand with PCI Express, IEEE Micro, Jan 2005.
28	J. Liu, J. Wu, and DK Panda, High Performance RDMA-Based MPI Implementation over InfiniBand, Int'l Journal of Parallel Programming: Volume 32, Number 3, Jun 2004.
29	J. Liu, B. Chandrasekaran, W. Yu, J. Wu, D. Buntinas, S. Kini, P. Wyckoff, and DK Panda, Micro-Benchmark Performance Comparison of High-Speed Cluster Interconnects IEEE Micro, Jan 2004.
30	A. Wagner, D. Buntinas, R. Brightwell, and DK Panda, Application-Bypass Reduction for Large-Scale Clusters. Int'l Journal of High Performance Computing and Networking Internationall Journal of High Performance Computing and Networking, Cluster 2003 Special Issue. In Press, Dec 2003.
31	R. Sivaram, C. Stunkel, and DK Panda, HIPIQS: A High-Performance Switch Architecture using Input Queuing IEEE Transactions on Parallel and Distributed Systems. Vol. 13, No. 3, pp. 275-289, Mar 2002.
32	M. Banikazemi, B. Abali, L. Herger, and DK Panda, Design Alternatives for Virtual Interface Architecture (VIA) and an Implementation on IBM Netfinity NT Cluster Journal of Parallel and Distributed Computing, Special Issue on Clusters, Volume 61, Number 11, pp. 1512-1545, Nov 2001.
33	M. Banikazemi, R. K. Govindaraju, R. Blackmore, and DK Panda, MPI-LAPI: An Efficient Implementation of MPI for IBM RS/6000 SP Systems IEEE Transactions on Parallel and Distributed Systems, Vol. 12, No. 10, pp. 1081-1093, Oct 2001.
34	B. Abali, C. B. Stunkel, J. Herring, M. Banikazemi, DK Panda, C. Aykanat, and Y. Aydogan, Adaptive Routing on the New Switch Chip for IBM SP Systems Journal of Parallel and Distributed Computing, Special Issue on Routing in Computer and Communication Networks, Volume 61, Number 9, pp. 1148-1179, Sep 2001.
35	R. Kesavan, and DK Panda, Efficient Multicast on Irregular Switch-based Cut-Through Networks with Up-Down Routing IEEE Transactions on Parallel and Distributed Systems, Vol. 12, No. 8, pp. 808-828, Aug 2001.
36	R. Sivaram, R. Kesavan, DK Panda, and C. Stunkel Architectural Support for Efficient Multicasting in Irregular Networks, Architectural Support for Efficient Multicasting in Irregular Networks IEEE Transactions on Parallel and Distributed Systems, Vol. 12, No. 5, pp. 489-513, May 2001.
37	R. Sivaram, C. Stunkel, and DK Panda, Implementing Multidestination Worms in Switch-Based Parallel Systems: Architectural Alternatives and their Impact IEEE Transactions on Parallel and Distributed Systems, Vol. 11, No. 8, pp. 794-812, Aug 2000.
38	R. Kesavan, and DK Panda, Multiple Multicast with Minimized Node Contention on Wormhole k-ary n-cube Networks IEEE Transactions on Parallel and Distributed Systems, Vol. 10, No. 4, pp. 371-393, Apr 1999.
39	D. Dai, and DK Panda, Exploiting the Benefits of Multiple-Path Network in DSM Systems: Architectural Alternatives and Performance Evaluation IEEE Transactions on Computers, Special Issue on Cache Memory, Vol. 48, No. 2, pp. 236-244, Feb 1999.
40	R. Prakash, and DK Panda, Designing Communication Strategies for Heterogeneous Parallel Systems, Parallel Computing, Volume 24, pp. 2035-2052, Dec 1998.
41	R. Sivaram, DK Panda, and C. B. Stunkel, Efficient Broadcast and Multicast on Multistage Interconnection Networks using Multiport Encoding, IEEE Transactions on Parallel and Distributed Systems, Vol. 9, No. 10, pp. 1004-1028, Oct 1998.
42	D. Basak, and DK Panda, Designing Clustered Multiprocessor Systems under Packaging and Technological Advancements IEEE Transactions on Parallel and Distributed Systems, Vol. 7, No. 9, pp. 962-978, Sep 1996.

Book Chapter (3)
1	X. Lu, J. Zhang, and DK Panda, Building Efficient HPC Cloud with SR-IOV Enabled InfiniBand: The MVAPICH2 Approach , Book "Research Advances in Cloud Computing", edited by Sanjay Chaudhary, Gaurav Somani, and Rajkumar Buyya, Springer International Publishing , Aug 2017.
2	X. Lu, and DK Panda, Contribution on Multiple Chapters related to OpenStack, Virtualized HPC, HPC Network Fabric, and HPC Workload Management , Book "The Crossroads of Cloud and HPC: OpenStack for Scientific Research; Exploring OpenStack Cloud Computing for Scientific Workloads", Edited by Stig Telfer - OpenStack Foundation Publishing (Invited Book Chapter) , Nov 2016.
3	X. Lu, M. W. Rahman, N. Islam, D. Shankar, and DK Panda, Accelerating Big Data Processing on Modern HPC Clusters , Book "Conquering Big Data with High Performance Computing", Edited by Ritu Arora - Springer International Publishing (Invited Book Chapter) , Jul 2016.

Conferences & Workshops (550)
1	HAT-MPI: Hierarchical Auto Tuning of MPI Inter-Node Communication on InfiniBand Clusters S. Lee, S. Tilford, and DK Panda, 4th Workshop on AI for Systems in conjunction with HPDC 2026, Jul 2026 [Bib - Plain]
2	Design and Implementation of Multi-Rail-Aware Hierarchical MPI Reduce-Scatter and Allgather Operations C. Chen, J. Yao, and DK Panda, ISC HIGH PERFORMANCE 2026, Jun 2026 [Bib - Plain]
3	Understanding Buffer Allocation and Data Transfer Mechanisms on AMD MI300A APUs G. Kuncham, S. Zhang, and DK Panda, ISC HIGH PERFORMANCE 2026, Jun 2026 [[Research Poster]] [Bib - Plain]
4	NIMBLE: Node-Interconnect Multi-Path Balancing with On-the-fly Orchestration for High Bandwidth GPU Clusters J. Yao, and DK Panda, ISC HIGH PERFORMANCE 2026, Jun 2026 [[Research Poster]] [Bib - Plain]
5	Multi-Channel DMA-Accelerated MPI Intra-Node Communication: A Hybrid Adaptive Framework with Memory Copy Offloading S. Xu, S. Lee, G. Kuncham, and DK Panda, ISC HIGH PERFORMANCE 2026, Jun 2026 [Bib - Plain]
6	Design and Implementation of Casting Compression for GPU-Aware MPI Collectives C. Chen, N. Contini, L. Xu, J. Queiser, H. Subramoni, and DK Panda, 40th IEEE International Parallel & Distributed Processing Symposium, May 2026 [Bib - Plain]
7	From Skew to Symmetry: Node-Interconnect Multi-Path Balancing with Execution-time Planning for Modern GPU Clusters J. Yao, K. Suresh, B. Ramesh, H. Subramoni, and DK Panda, 40th IEEE International Parallel & Distributed Processing Symposium, May 2026 [Bib - Plain]
8	One Memory-Many Paths: Early Experiences with Allocation and Data Copy Strategies on MI300A G. Kuncham, S. Zhang, and DK Panda, 40th IEEE International Parallel & Distributed Processing Symposium, May 2026 [Best Paper Finalist] [Bib - Plain]
9	MAC-Attention: a Match-Amend-Complete Scheme for Fast and Accurate Attention Computation J. Yao, S. Jacobs, W. Krichene, M. Tanaka, and DK Panda, Ninth Annual Conference on Machine Learning and Systems, MLSys 26, May 2026 [Bib - Plain]
10	Design and Optimization of GPU-Aware MPI Allreduce Using Direct Sendrecv Communication C. Chen, J. Yao, H. Subramoni, and DK Panda, NVIDIA GTC AI Conference 2026, Mar 2026 [Research Poster] [Bib - Plain]
11	HyperSack: Resource-Aware Distributed Hyperparameter Optimization for Lightweight Vision and Language Models on NVIDIA GPU Systems N. Alnaasan, and DK Panda, NVIDIA GTC AI Conference 2026, Mar 2026 [Research Poster] [Bib - Plain]
12	Supporting Ultra-High-Resolution Digital Agriculture Tasks with Fully Synthetic Curriculum Learning. J. Hatef, Q. Anthony, N. Alnaasan, and DK Panda, The IEEE/CVF Winter Conference on Applications of Computer Vision, HARVEST-Vision 2026, Mar 2026 [Bib - Plain]
13	Performance Characterization of Data Transfer and Allocation Strategies on AMD MI300A APUs: Early Experiences G. Kuncham, S. Zhang, B. Ramesh, K. Suresh, and DK Panda, 32nd IEEE International Conference on High Performance Computing, Data, & Analytics, Dec 2025 [Research Poster] [Bib - Plain]
14	Enhanced MPI Intra-node Communication Framework: A Hybrid Approach with Cooperative DMA Channel-based Data Transfer S. Xu, T. Tran, and DK Panda, 32nd IEEE International Conference on High Performance Computing, Data, & Analytics, Dec 2025 [Best Paper Finalist] [Bib - Plain]
15	A Streaming Collectives Interface Targeting Dataflow Acceleration and HPC Workloads N. Contini, J. Queiser, B. Ramesh, H. Subramoni, and DK Panda, The International Conference for High Performance Computing, Networking, Storage, and Analysis 2025, Nov 2025 [Bib - Plain]
16	OpenSHMEM MLIR: A Dialect for Compile-Time Optimization of One-Sided Communications M. Beebe, B. Michalowicz, A. McNamara, Y. Kumar, DK Panda, Y. Chen, WK Poole, and S. Poole, The Eleventh Annual Workshop on the LLVM Compiler Infrastructure in HPC, Nov 2025 [Bib - Plain]
17	MPI Communication Performance on AMD MI300A: Microbenchmarks and Applications G. Kuncham, S. Zhang, S. Mohammad, C. Chen, and DK Panda, Seventh Annual Workshop on Emerging Parallel and Distributed Runtime Systems and Middleware, in conjunction with SC '25, Nov 2025 [Bib - Plain]
18	Design and Optimization of GPU-Aware MPI Allreduce Using Direct Sendrecv Communication C. Chen, J. Yao, H. Subramoni, and DK Panda, 54th International Conference on Parallel Processing, Sep 2025 [Bib - Plain]
19	HARVEST Inference: Characterizing Digital Agriculture Workloads across Compute Continuum T. Chen, Q. Anthony, and DK Panda, First International Workshop on Applications of HPC and AI in Agriculture (HARVEST), in conjunction with Int'l Conference on Parallel Processing (ICPP '25), Sep 2025 [Bib - Plain]
20	Towards Dynamic Message Passing Protocols for Stencil-Based Communication Patterns K. Suresh, B. Ramesh, G. Kuncham, H. Subramoni, and DK Panda, IEEE International Conference on Cluster Computing 2025, Sep 2025 [Bharath and Kaushik are Co-Lead Authors] [Bib - Plain]
21	Characterizing Communication Patterns in Distributed Large Language Model Inference L. Xu, K. Suresh, Q. Anthony, N. Alnaasan, and DK Panda, IEEE Hot Interconnects Symposium 2025, Aug 2025 [Bib - Plain]
22	OMB-Compr: An Extension to OSU Micro Benchmarks for Collective Compression Error Measurement J. Queiser, N. Contini, H. Subramoni, and DK Panda, Practice and Experience in Advanced Research Computing 2025, Jul 2025 [Bib - Plain]
23	OpenSHMEM Performance on Bluefield-3 Data Processing Units (DPUs) M. Beebe, B. Michalowicz, DK Panda, Y. Chen, WK. Poole, and S. Poole, Practice and Experience in Advanced Research Computing 2025, Jul 2025 [Short Paper] [Best Student Paper Award] [Bib - Plain]
24	Use of BlueField-SmartNICs in Offloading One-Sided Communication Primitives B. Michalowicz, K. Suresh, H. Subramoni, DK Panda, and S. Poole, ISC HIGH PERFORMANCE 2025, Jun 2025 [Research Poster] [Bib - Plain]
25	Design and Implementation of MPI Collective Operations for Large Message Communication on AMD GPUs C. Chen, L. Xu, H. Subramoni, and DK Panda, ISC HIGH PERFORMANCE 2025, Jun 2025 [Research Poster] [Bib - Plain]
26	Design and Implementation of a GPU-Aware MPI Collective Library for Intel GPUs C. Chen, G. Kuncham, H. Subramoni, and DK Panda, ISC HIGH PERFORMANCE 2025, Jun 2025 [Research Poster] [Bib - Plain]
27	Unified Designs of Multi-rail-aware MPI Allreduce and Alltoall Operations Across Diverse GPU and Interconnect Systems C. Chen, J. Yao, L. Xu, H. Subramoni, and DK Panda, 39th IEEE International Parallel & Distributed Processing Symposium, Jun 2025 [Bib - Plain]
28	Training ultra long context language model with fully pipelined distributed transformer J. Yao, S. Jacobs, M. Tanaka, O. Ruwase, H. Subramoni, and DK Panda, The Eighth Annual Conference on Machine Learning and Systems, May 2025 [Bib - Plain]
29	Effective and Efficient Offloading Designs for One-Sided Communication to SmartNICs B. Michalowicz, K. Suresh, H. Subramoni, M. Abduljabbar, DK Panda, and S. Poole, 31st IEEE International Conference on High Performance Computing, Data, and Analytics, Dec 2024 [Bib - Plain]
30	Using BlueField-3 SmartNICs to Offload Vector Operations in Krylov Subspace Methods K. Suresh, B. Michalowicz, N. Contini, B. Ramesh, M. Abduljabbar, A. Shafi, H. Subramoni, and DK Panda, 31st IEEE International Conference on High Performance Computing, Data, and Analytics, Dec 2024 [Bib - Plain]
31	Design and Implementation of Kernel-based MPI Reduction Operations for Intel GPUs C. Chen, G. Kuncham, H. Subramoni, and DK Panda, 31st IEEE International Conference on High Performance Computing, Data, and Analytics, Dec 2024 [Bib - Plain]
32	Scaling Large Language Model Training on Frontier with Low-Bandwidth Partitioning L. Xu, Q. Anthony, J. Hatef, A. Shafi, H. Subramoni, and DK Panda, 31st IEEE International Conference on High Performance Computing, Data, and Analytics, Dec 2024 [Bib - Plain]
33	HyperSack: Distributed Hyperparameter Optimization for Deep Learning using Resource-Aware Scheduling on Heterogeneous GPU Systems N. Alnaasan, B. Ramesh, J. Yao, A. Shafi, H. Subramoni, and DK Panda, 31st IEEE International Conference on High Performance Computing, Data, and Analytics, Dec 2024 [Bib - Plain]
34	HARVEST-2.0: High-Performance Vision Framework for End-to-end Preprocessing, Training, Inference, and Visualization N. Alnaasan, A. Potlapally, T. Chen, M. Lieber, A. Shafi, H. Subramoni, S. Shearer, and DK Panda, The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC'24), Nov 2024 [Research Poster] [Bib - Plain]
35	Demystifying the Communication Characteristics for Distributed Transformer Models Q. Anthony, B. Michalowicz, J. Hatef, L. Xu, M. Abduljabbar, A. Shafi, H. Subramoni, and DK Panda, IEEE Hot Interconnects Symposium 2024, Aug 2024 [Q. Anthony and B. Michalowicz are co-lead authors] [Bib - Plain]
36	Characterizing Communication in Distributed Parameter-Efficient Fine-Tuning for Large Language Models N. Alnaasan, H. Huang, A. Shafi, H. Subramoni, and DK Panda, IEEE Hot Interconnects Symposium 2024, Aug 2024 [Bib - Plain]
37	OHIO: Improving RDMA Network Scalability in MPI_Alltoall through Optimized Hierarchical and Intra/Inter-Node Communication Overlap Design T. Tran, G. Kuncham, B. Ramesh, S. Xu, H. Subramoni, M. Abduljabbar, and DK Panda, IEEE Hot Interconnects Symposium 2024, Aug 2024 [Bib - Plain]
38	The Case for Co-Designing Model Architectures with Hardware Q. Anthony, J. Hatef, Deepak Narayanan, Stella Biderman, Stas Bekman, Junqi Yin, A. Shafi, H. Subramoni, and DK Panda, 53rd International Conference on Parallel Processing, Aug 2024 [Bib - Plain]
39	Design and Implementation of an IPC-based Collective MPI Library for Intel GPUs C. Chen, G. Kuncham, P. Kousha, H. Subramoni, and DK Panda, Practice and Experience in Advanced Research Computing, Jul 2024 [Bib - Plain]
40	OMB-CXL: A Micro-Benchmark Suite for Evaluating MPI Communication Utilizing Compute Express Link Memory Devices T. Tran, M. Abduljabbar, Hooyoung Ahn, Seonyoung Kim, Yoomi Park, Woojong Han, Shinyoung Ahn, H. Subramoni, and DK Panda, Practice and Experience in Advanced Research Computing, Jul 2024 [July 21st to July 25th, 2024 in Providence, RI.] [Bib - Plain]
41	A Novel LLM-enabled Framework for Accelerating the Creation of Knowledge Graphs for HPC P. Kousha, V. Sathu, M. Han, J. Jani, N. Alnaasan, H. Subramoni, and DK Panda, Practice and Experience in Advanced Research Computing, Jul 2024 [Jul 2024] [Bib - Plain]
42	OMB-FPGA: A Microbenchmark Suite for FPGA-aware MPIs using OpenCL and SYCL N. Contini, M. Abduljabbar, H. Subramoni, and DK Panda, Practice and Experience in Advanced Research Computing, Jul 2024 [Jul 2024] [Bib - Plain]
43	Infer-HiRes: Accelerating Inference for High-Resolution Images with Quantization and Distributed Deep Learning R. Gulhane, Q. Anthony, H. Subramoni, and DK Panda, Practice and Experience in Advanced Research Computing, Jul 2024 [Jul 2024] [Bib - Plain]
44	PML-MPI: A Pre-Trained ML Framework for Efficient Collective Algorithm Selection in MPI M. Han, G. Kuncham, B. Michalowicz, R. Vaidya, M. Abduljabbar, A. Shafi, H. Subramoni, and DK Panda, iWAPT '24 (IPDPSW), May 2024 [Bib - Plain]
45	Exploiting Inter-Layer Expert Affinity for Accelerating Mixture-of-Experts Model Inference J. Yao, Q. Anthony, A. Shafi, H. Subramoni, and DK Panda, 38th IEEE International Parallel & Distributed Processing Symposium, May 2024 [Bib - Plain]
46	HINT: Designing Cache-Efficient MPI_Alltoall using Hybrid Memory Copy Ordering and Non-Temporal Instructions B. Ramesh, N. Contini, N. Alnaasan, K. Suresh, M. Abduljabbar, A. Shafi, H. Subramoni, and DK Panda, 38th IEEE International Parallel & Distributed Processing Symposium, May 2024 [Bib - Plain]
47	Accelerating MPI AllReduce Communication with Efficient GPU-Based Compression Schemes on Modern GPU Clusters Q. Zhou, B. Ramesh, A. Shafi, M. Abduljabbar, H. Subramoni, and DK Panda, ISC HIGH PERFORMANCE 2024, May 2024 [Bib - Plain]
48	Profiling, Storing and Monitoring HPC Communication Data at Scale by OSU INAM P. Kousha, H. Subramoni, DK Panda, M. Tatineni, and P. Mulrooney, ISC HIGH PERFORMANCE 2024, May 2024 [Research Poster] [Bib - Plain]
49	High-Performance Semi-Supervised Learning with HARVEST: A Distributed Computer Vision Framework for Expert Labeling N. Alnaasan, M. Lieber, A. Shafi, H. Subramoni, S. Shearer, and DK Panda, ISC HIGH PERFORMANCE 2024, May 2024 [Research Poster] [Best Poster Award] [Bib - Plain]
50	Accelerating Large Language Model Training with Hybrid GPU-based Compression L. Xu, Q. Anthony, Q. Zhou, N. Alnaasan, R. Gulhane, A. Shafi, H. Subramoni, and DK Panda, IEEE/ACM International Symposium on Cluster, Cloud, and Internet Computing 2024, May 2024 [Bib - Plain]
51	AccDP: Accelerated Data-Parallel Distributed DNN Training for Modern GPU-Based HPC Clusters N. Alnaasan, A. Shafi, H. Subramoni, and DK Panda, NVIDIA GTC AI Conference 2024, Mar 2024 [Research Poster] [Bib - Plain]
52	Optimized All-to-all Connection Establishment for High-Performance MPI Libraries over InifiniBand S. Xu, G. Kuncham, M. Abduljabbar, H. Subramoni, and DK Panda, 30th IEEE INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING, DATA, & ANALYTICS, Dec 2023 [Bib - Plain]
53	Flover: A Temporal Fusion Framework for Efficient Autoregressive Model Parallel Inference J. Yao, N. Alnaasan, T. Chen, A. Shafi, H. Subramoni, and DK Panda, 30th IEEE INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING, DATA, & ANALYTICS, Dec 2023 [Bib - Plain]
54	MPI Allgather Utilizing CXL Shared Memory Pool in Multi-Node Computing Systems Hooyoung Ahn, Seonyoung Kim, Yoomi Park, Woojong Han, Shinyoung Ahn, T. Tran, B. Ramesh, H. Subramoni, and DK Panda, IEEE International Conference on Big Data, Dec 2023 [Dec 15-18, 2024 @ Washington DC, USA] [Bib - Plain]
55	HARVEST: High-Performance Artificial Vision Framework for Expert Labeling using Semi-Supervised Training N. Alnaasan, M. Lieber, A. Shafi, H. Subramoni, S. Shearer, and DK Panda, 2023 IEEE International Conference on Big Data, Dec 2023 [Bib - Plain]
56	MPI4Spark Meets YARN: Enhancing MPI4Spark through YARN support for HPC K. Al Attar, A. Shafi, H. Subramoni, and DK Panda, 11th International Workshop on Distributed Storage and Blockchain Technologies for Big Data (IEEE Big Data '23), Dec 2023 [Bib - Plain]
57	Benchmarking Modern Databases for Storing and Profiling Very Large Scale HPC Communication Data P. Kousha, Q. Zhou, H. Subramoni, and DK Panda, The 15th BenchCouncil International Symposium On Benchmarking, Measuring And Optimizing, Dec 2023 [Bib - Plain]
58	MPI-xCCL: A Portable MPI Library over Collective Communication Libraries for Various Accelerators C. Chen, K. Khorassani, P. Kousha, Q. Zhou, J. Yao, H. Subramoni, and DK Panda, Sixth Annual Workshop on Emerging Parallel and Distributed Runtime Systems and Middleware, Nov 2023 [Bib - Plain]
59	Designing In-network Computing Aware Reduction Collectives in MPI B. Ramesh, G. Kuncham, K. Suresh, R. Vaidya, N. Alnaasan, M. Abduljabbar, A. Shafi, and DK Panda, Hot Interconnects 2023, Aug 2023 [Bib - Plain]
60	Battle of the BlueFields: An In-Depth Comparison of the BlueField-2 and BlueField-3 SmartNICs B. Michalowicz, K. Suresh, H. Subramoni, DK Panda, and S. Poole, Hot Interconnects 2023, Aug 2023 [Bib - Plain]
61	DPU-Bench: A Micro-Benchmark Suite to Measure Offload Efficiency Of SmartNICs B. Michalowicz, K. Suresh, H. Subramoni, DK Panda, and S. Poole, Practice and Experience in Advanced Research Computing 23, Jul 2023 [Bib - Plain]
62	Optimizing Amber for Device-to-Device GPU Communication S. Khuvis, K. Tomko, S. Brozell, C. Chen, H. Subramoni, and DK Panda, Practice and Experience in Advanced Research Computing 23, Jul 2023 [Bib - Plain]
63	Enabling Reconfigurable HPC through MPI-based Inter-FPGA Communication N. Contini, B. Ramesh, K. Suresh, T. Tran, B. Michalowicz, M. Abduljabbar, H. Subramoni, and DK Panda, International Conference on Supercomputing 2023, Jun 2023 [Bib - Plain]
64	SAI: AI-Enabled Speech Assistant Interface for Science Gateways in HPC P. Kousha, A. Jain, A. Kolli, M. Lieber, M. Han, N. Contini, H. Subramoni, and DK Panda, ISC HIGH PERFORMANCE 2023, May 2023 [Bib - Plain]
65	A Novel Framework for Efficient Offloading of Communication Operations to Bluefield SmartNICs K. Suresh, B. Michalowicz, B. Ramesh, N. Contini, J. Yao, S. Xu, A. Shafi, H. Subramoni, and DK Panda, 37th IEEE International Parallel & Distributed Processing Symposium (IPDPS '23), May 2023 [Bib - Plain]
66	Accelerating Distributed Deep Learning Training with Compression Assisted Allgather and Reduce-Scatter Communication Q. Zhou, Q. Anthony, L. Xu, A. Shafi, M. Abduljabbar, H. Subramoni, and DK Panda, 37th IEEE International Parallel & Distributed Processing Symposium (IPDPS '23), May 2023 [Bib - Plain]
67	MCR-DL: Mix-and-Match Communication Runtime for Deep Learning Q. Anthony, Ammar Awan, J. Rasley, Y. He, A. Shafi, M. Abduljabbar, H. Subramoni, and DK Panda, 37th IEEE International Parallel & Distributed Processing Symposium (IPDPS '23), May 2023 [Bib - Plain]
68	Designing and Optimizing GPU-aware Nonblocking MPI Neighborhood Collective Communication for PETSc K. Khorassani, C. Chen, H. Subramoni, and DK Panda, 37th IEEE International Parallel & Distributed Processing Symposium (IPDPS '23), May 2023 [Bib - Plain]
69	In-Depth Evaluation of a Lower-Level Direct-Verbs API on InfiniBand-based Clusters: Early Experiences B. Michalowicz, K. Suresh, B. Ramesh, A. Shafi, H. Subramoni, M. Abduljabbar, and DK Panda, 25th Workshop on Advances in Parallel and Distributed Computational Models, May 2023 [Held in conjunction with IPDPS 2023] [Bib - Plain]
70	Implementing and Optimizing a GPU-aware MPI Library for Intel GPUs: Early Experiences C. Chen, K. Khorassani, G. Kuncham, R. Vaidya, M. Abduljabbar, A. Shafi, H. Subramoni, and DK Panda, THE 23RD IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing, May 2023 [Bib - Plain]
71	ScaMP: Scalable Meta-Parallelism for Deep Learning Search Q. Anthony, L. Xu, A. Shafi, H. Subramoni, and DK Panda, THE 23RD IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing, May 2023 [Bib - Plain]
72	Performance Characterization of using Quantization for DNN Inference on Edge Devices H. Ahn, T. Chen, N. Alnaasan, A. Shafi, M. Abduljabbar, H. Subramoni, and DK Panda, 7TH IEEE INTERNATIONAL CONFERENCE ON FOG AND EDGE COMPUTING, May 2023 [Bib - Plain]
73	Efficient Personalized and Non-Personalized Alltoall Communication for Modern Multi-HCA GPU-Based Clusters K. Suresh, A. Paniraja Guptha, B. Michalowicz, B. Ramesh, M. Abduljabbar, A. Shafi, H. Subramoni, and DK Panda, 29th IEEE International Conference on High Performance Computing, Data, and Analytics, Dec 2022 [Bib - Plain]
74	AccDP: Accelerated Data-Parallel Distributed DNN Training for Modern GPU-Based HPC Clusters N. Alnaasan, A. Jain, A. Shafi, H. Subramoni, and DK Panda, 29th IEEE International Conference on High Performance Computing, Data, and Analytics, Dec 2022 [Bib - Plain]
75	Designing Efficient Pipelined Communication Schemes using Compression in MPI Libraries B. Ramesh, Q. Zhou, A. Shafi, M. Abduljabbar, H. Subramoni, and DK Panda, 29th IEEE International Conference on High Performance Computing, Data, and Analytics, Dec 2022 [Bib - Plain]
76	Accelerating Broadcast Communication with GPU Compression for Deep Learning Workloads Q. Zhou, Q. Anthony, A. Shafi, H. Subramoni, and DK Panda, 29th IEEE International Conference on High Performance Computing, Data, and Analytics, Dec 2022 [Bib - Plain]
77	Spark Meets MPI: Towards High-Performance Communication Framework for Spark using MPI K. Al Attar, A. Shafi, M. Abduljabbar, H. Subramoni, and DK Panda, IEEE Cluster '22, Sep 2022 [Bib - Plain]
78	Designing Hierarchical Multi-HCA Aware Allgather in MPI T. Tran, B. Michalowicz, B. Ramesh, H. Subramoni, A. Shafi, and DK Panda, Fifteenth International Workshop on Parallel Programming Models and Systems Software for High-End Computing (P2S2), 2022, Aug 2022 [To be held in conjunction with ICPP 2022: The 51st International Conference on Parallel Processing August 29th to Sept 1st, 2022 in Bordeaux, France] [Bib - Plain]
79	Network-Assisted Non-Contiguous Transfers for GPU-Aware MPI Libraries K. Suresh, K. Khorassani, C. Chen, B. Ramesh, M. Abduljabbar, A. Shafi, and DK Panda, Hot Interconnects 29, Aug 2022 [Bib - Plain]
80	High Performance MPI over the Slingshot Interconnect: Early Experiences K. Khorassani, C. Chen, B. Ramesh, A. Shafi, H. Subramoni, and DK Panda, Practice and Experience in Advanced Research Computing, Jul 2022 [Best Student Paper Award] [Bib - Plain]
81	Arm meets Cloud: A Case Study of MPI Library Performance on AWS Arm-based HPC Cloud with Elastic Fabric Adapter S. Xu, A. Shafi, H. Subramoni, and DK Panda, 24th Workshop on Advances in Parallel and Distributed Computational Models, May 2022 [Bib - Plain]
82	Towards Java-based HPC using the MVAPICH2 Library: Early Experiences K. Al Attar, A. Shafi, H. Subramoni, and DK Panda, HIPS '22 (IPDPSW), May 2022 [Bib - Plain]
83	Highly Efficient Alltoall and Alltoallv Communication Algorithms for GPU Systems C. Chen, K. Khorassani, Q. Anthony, A. Shafi, H. Subramoni, and DK Panda, Heterogeneity in Computing Workshop (HCW 2022), May 2022 [held in conjunction with IPDPS'22] [Bib - Plain]
84	OMB-Py: Python Micro-Benchmarks for Evaluating Performance of MPI Libraries on HPC Systems N. Alnaasan, A. Jain, A. Shafi, H. Subramoni, and DK Panda, 23rd Parallel and Distributed Scientific and Engineering Computing Workshop (PDSEC) at IPDPS22, May 2022 [Bib - Plain]
85	Hy-Fi: Hybrid Five-Dimensional Parallel DNN Training on High-Performance GPU Clusters A. Jain, A. Shafi, Q. Anthony, P. Kousha, H. Subramoni, and DK Panda, ISC HIGH PERFORMANCE 2022, May 2022 [Bib - Plain]
86	Accelerating MPI All-to-All Communication with Online Compression on Modern GPU Clusters Q. Zhou, P. Kousha, Q. Anthony, K. Khorassani, A. Shafi, H. Subramoni, and DK Panda, ISC HIGH PERFORMANCE 2022, May 2022 [Bib - Plain]
87	"Hey CAI" - Enhancing User Productivity through a Conversational AI Enabled User Interface for HPC Tools P. Kousha, A. Jain, A. Kolli, S. Prasanna, S. Miriyala, H. Subramoni, A. Shafi, and DK Panda, ISC HIGH PERFORMANCE 2022, May 2022 [Bib - Plain]
88	OMB-Py: Python Micro-Benchmarks for Evaluating Performance of MPI Libraries and Machine Learning Applications on HPC Systems N. Alnaasan, A. Jain, A. Shafi, H. Subramoni, and DK Panda, ISC HIGH PERFORMANCE 2022, May 2022 [Research Poster] [Best Poster Award] [Bib - Plain]
89	Layout aware Hardware assisted Designs for Derived Data Types in MPI K. Suresh, B. Ramesh, C. Chen, M. Ghazimirsaeed, M. Bayatpour, A. Shafi, H. Subramoni, and DK Panda, 28th IEEE International Conference on High Performance Computing, Data, Analytics, and Data Science, Dec 2021 [Bib - Plain]
90	Towards Architecture-aware Hierarchical Communication Trees on Modern HPC Systems B. Ramesh, J. Hashmi, S. Xu, A. Shafi, M. Ghazimirsaeed, M. Bayatpour, H. Subramoni, and DK Panda, 28th IEEE International Conference on High Performance Computing, Data, Analytics, and Data Science, Dec 2021 [Best Paper Finalist] [Bib - Plain]
91	Large-Message Nonblocking MPI_Iallgather and MPI_Ibcast Offload via BlueField-2 DPU N. Sarkauskas, M. Bayatpour, T. Tran, B. Ramesh, H. Subramoni, and DK Panda, 28th IEEE International Conference on High Performance Computing, Data, Analytics, and Data Science, Dec 2021 [Short Paper] [Bib - Plain]
92	Accelerating CPU-based Distributed DNN Training on Modern HPC Clusters using BlueField-2 DPUs A. Jain, N. Alnaasan, A. Shafi, H. Subramoni, and DK Panda, 28th IEEE Hot Interconnects, Aug 2021 [Bib - Plain]
93	INAM: Cross-stack Profiling and Analysis of Communication in MPI-based Applications P. Kousha, K. Raj, M. Kedia, H. Subramoni, A. Jain, A. Shafi, DK Panda, Trey Dockendorf, Heechang Na, and K. Tomko, Practice and Experience in Advanced Research Computing 2021, Jul 2021 [Bib - Plain]
94	BluesMPI: Efficient MPI Non-blocking Alltoall Offloading Designs on Modern BlueField Smart NICs M. Bayatpour, N. Sarkauskas, H. Subramoni, J. Hashmi, and DK Panda, ISC HIGH PERFORMANCE 2021, Jun 2021 [Bib - Plain]
95	Designing a ROCm-aware MPI Library for AMD GPUs: Early Experiences K. Khorassani, J. Hashmi, C. Chu, C. Chen, H. Subramoni, and DK Panda, ISC HIGH PERFORMANCE 2021, Jun 2021 [Bib - Plain]
96	Scaling Single-Image Super-Resolution Training on Modern HPC Clusters: Early Experiences Q. Anthony, L. Xu, H. Subramoni, and DK Panda, Scalable Deep Learning over Parallel And Distributed Infrastructures, May 2021 [Bib - Plain]
97	SUPER: SUb-Graph Parallelism for TransformERs A. Jain, T. Moon, T. Benson, H. Subramoni, S. Jacobs, DK Panda, and B. Essen, 35th IEEE International Parallel & Distributed Processing Symposium, May 2021 [Bib - Plain]
98	Designing High-Performance MPI Libraries with On-the-fly Compression for Modern GPU Clusters Q. Zhou, C. Chu, N. Senthil Kumar, P. Kousha, M. Ghazimirsaeed, H. Subramoni, and DK Panda, 35th IEEE International Parallel & Distributed Processing Symposium, May 2021 [Best Paper Finalist] [Bib - Plain]
99	Adaptive and Hierarchical Large Message All-to-all Communication Algorithms for Large-scale Dense GPU Systems K. Khorassani, C. Chu, Q. Anthony, H. Subramoni, and DK Panda, The 21st IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing, May 2021 [Bib - Plain]
100	Efficient MPI-based Communication for GPU-Accelerated Dask Applications A. Shafi, J. Hashmi, H. Subramoni, and DK Panda, The 21st IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing, May 2021 [Bib - Plain]
101	Blink: Towards Efficient RDMA-based Communication Coroutines for Parallel Python Applications A. Shafi, J. Hashmi, H. Subramoni, and DK Panda, 27TH IEEE INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING, DATA, AND ANALYTICS, Dec 2020 [Bib - Plain]
102	A Hierarchical and Load-Aware Design for Large Message Neighborhood Collectives M. Ghazimirsaeed, Q. Zhou, A. Ruhela, M. Bayatpour, H. Subramoni, and DK Panda, SC 2020, Nov 2020 [Bib - Plain]
103	GEMS: GPU Enabled Memory Aware Model Parallelism System for Distributed DNN Training A. Jain, Ammar Awan, A. Aljuhani, J. Hashmi, Q. Anthony, H. Subramoni, DK Panda, R. Machiraju, and A. Parwani, SC 2020, Nov 2020 [Bib - Plain]
104	Scalable MPI Collectives using SHARP: Large Scale Performance Evaluation on the TACC Frontera System B. Ramesh, K. Suresh, N. Sarkauskas, M. Bayatpour, J. Hashmi, H. Subramoni, and DK Panda, ExaMPI2020 - Workshop on Exascale MPI 2020, Nov 2020 [Bib - Plain]
105	MPI Meets Cloud: Case Study with Amazon EC2 and Microsoft Azure S. Xu, M. Ghazimirsaeed, J. Hashmi, H. Subramoni, and DK Panda, 4th Workshop on Emergine Parallel and Distributed Runtime Systems and Middlewares, Nov 2020 [Bib - Plain]
106	Exploring Hybrid MPI+Kokkos Tasks Programming Model Samuel Khuvis, K. Tomko, J. Hashmi, and DK Panda, The 3rd Annual Parallel Applications Workshop, Alternatives to MPI+X (PAW-ATM), Nov 2020 [held in conjunction with SC’20] [Bib - Plain]
107	Design and Characterization of Infiniband Hardware Tag Matching in MPI M. Bayatpour, M. Ghazimirsaeed, S. Xu, H. Subramoni, and DK Panda, The 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing, Nov 2020 [Bib - Plain]
108	Accelerating GPU-based Machine Learning in Python using MPI Library: A Case Study with MVAPICH2-GDR M. Ghazimirsaeed, Q. Anthony, A. Shafi, H. Subramoni, and DK Panda, 6th Workshop on Machine Learning in HPC Environments, Nov 2020 [Bib - Plain]
109	Dynamic Kernel Fusion for Bulk Non-contiguous Data Transfer on GPU Clusters C. Chu, K. Khorassani, Q. Zhou, H. Subramoni, and DK Panda, 22nd IEEE International Conference on Cluster Computing (IEEE Cluster 2020), Sep 2020 [Bib - Plain]
110	Accelerated Real-time Network Monitoring and Profiling at Scale using OSU INAM P. Kousha, K. Raj, H. Subramoni, DK Panda, H. Na, T. Dockendorf, and K. Tomko, Practice and Experience in Advanced Research Computing 2020, Jul 2020 [Bib - Plain]
111	NV-Group: Link-Efficient Reductions for Distributed Deep Learning on Modern Dense GPU Systems C. Chu, P. Kousha, Ammar Awan, K. Khorassani, H. Subramoni, and DK Panda, The 34th ACM International Conference on Supercomputing (ICS-2020), Jun 2020 [Bib - Plain]
112	Communication-Aware Hardware-Assisted MPI Overlap Engine M. Bayatpour, J. Hashmi, S. Chakraborty, K. Suresh, M. Ghazimirsaeed, B. Ramesh, H. Subramoni, and DK Panda, ISC HIGH PERFORMANCE 2020, Jun 2020 [Bib - Plain]
113	HyPar-Flow: Exploiting MPI and Keras for Scalable Hybrid-Parallel DNN Training with TensorFlow Ammar Awan, A. Jain, Q. Anthony, H. Subramoni, and DK Panda, ISC HIGH PERFORMANCE 2020, Jun 2020 [Bib - Plain]
114	OSU INAM: Profiling and Visualization Tool for Scalable and In-Depth Analysis of High-Performance GPU-enabled HPC Clusters P. Kousha, K. Raj, H. Subramoni, and DK Panda, ISC HIGH PERFORMANCE 2020, Jun 2020 [Best Poster Award] [Bib - Plain]
115	Efficient Training of Semantic Image Segmentation on Summit using Horovod and MVAPICH2-GDR Q. Anthony, Ammar Awan, A. Jain, H. Subramoni, and DK Panda, Scalable Deep Learning over Parallel and Distributed Infrastructures (ScaDL) at IPDPS '20, May 2020 [Bib - Plain]
116	Machine-agnostic and Communication-aware Designs for MPI on Emerging Architectures J. Hashmi, S. Xu, B. Ramesh, M. Bayatpour, H. Subramoni, and DK Panda, 34th IEEE International Parallel & Distributed Processing Symposium (IPDPS '20), May 2020 [Bib - Plain]
117	Performance Characterization of Network Mechanisms for Non-Contiguous Data Transfers in MPI K. Suresh, B. Ramesh, M. Ghazimirsaeed, M. Bayatpour, J. Hashmi, H. Subramoni, and DK Panda, Workshop on Scalable Networks for Advanced Computing Systems (SNACS) at IPDPS '20, May 2020 [Bib - Plain]
118	Analyzing and Understanding the Impact of Interconnect Performance on HPC, Big Data, and Deep Learning Applications: A Case Study with InfiniBand EDR and HDR A. Ruhela, S. Xu, K. Vadambacheri Manian, H. Subramoni, and DK Panda, Workshop on Scalable Networks for Advanced Computing Systems (SNACS) at IPDPS '20, May 2020 [Bib - Plain]
119	High-Performance Adaptive MPI Derived Datatype Communication for Modern Multi-GPU Systems C. Chu, J. Hashmi, K. Khorassani, H. Subramoni, and DK Panda, 26th IEEE International Conference on High Performance Computing, Data, Analytics and Data Science (HiPC '19), Dec 2019 [Bib - Plain]
120	Designing a Profiling and Visualization Tool for Scalable and In-Depth Analysis of High-Performance GPU Clusters P. Kousha, B. Ramesh, K. Suresh, C. Chu, A. Jain, N. Sarkauskas, H. Subramoni, and DK Panda, 26th IEEE International Conference on High Performance Computing, Data, Analytics and Data Science (HiPC '19), Dec 2019 [Bib - Plain]
121	SIMD-KV: Accelerating End-to-End Performance in Key-Value Stores with SIMD and RDMA over Emerging CPU Architectures D. Shankar, X. Lu, and DK Panda, 26th IEEE International Conference on High Performance Computing, Data, Analytics and Data Science (HiPC '19), Dec 2019 [Bib - Plain]
122	Design and Evaluation of Shared Memory CommunicationBenchmarks on Emerging Architectures using MVAPICH2 S. Xu, J. Hashmi, S. Chakraborty, H. Subramoni, and DK Panda, Third Annual Workshop on Emerging Parallel and Distributed Runtime Systems and Middleware, Nov 2019 [Bib - Plain]
123	Leveraging Network-level parallelism with Multiple Process-Endpoints for MPI Broadcast A. Ruhela, B. Ramesh, S. Chakraborty, H. Subramoni, J. Hashmi, and DK Panda, Third Annual Workshop on Emerging Parallel and Distributed Runtime Systems and Middleware, Nov 2019 [Bib - Plain]
124	OMB-UM: Design, Implementation, and Evaluation of CUDA Unified Memory Aware MPI Benchmarks K. Vadambacheri Manian, C. Chu, Ammar Awan, K. Khorassani, H. Subramoni, and DK Panda, 10th International Workshop in Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems, Nov 2019 [Bib - Plain]
125	Scaling TensorFlow, PyTorch, and MXNet using MVAPICH2 for High-Performance Deep Learning on Frontera A. Jain, Ammar Awan, H. Subramoni, and DK Panda, 3rd Deep Learning on Supercomputers Workshop (DLS) at SC19, Nov 2019 [Bib - Plain]
126	SimdHT-Bench: Characterizing SIMD-Aware Hash Table Designs on Emerging CPU Architectures D. Shankar, X. Lu, and DK Panda, 2019 IEEE International Symposium on Workload Characterization, Nov 2019 [Best Paper Finalist] [Bib - Plain]
127	Performance Characterization of DNN Training using TensorFlow and PyTorch on Modern Clusters A. Jain, Ammar Awan, Q. Anthony, H. Subramoni, and DK Panda, 21st IEEE International Conference on Cluster Computing, Sep 2019 [Bib - Plain]
128	Communication Profiling and Characterization of Deep-Learning Workloads on Clusters With High-Performance Interconnects Ammar Awan, A. Jain, C. Chu, H. Subramoni, and DK Panda, 26th Symposium on High-Performance Interconnects (HotI '19), Aug 2019 [Bib - Plain]
129	Designing Scalable and High-performance MPI Libraries on Amazon Elastic Fabric Adapter S. Chakraborty, S. Xu, H. Subramoni, and DK Panda, HOT Interconnects 26, Aug 2019 [Bib - Plain]
130	Performance Evaluation of MPI Libraries on GPU-enabled OpenPOWER Architectures: Early Experiences K. Khorassani, C. Chu, H. Subramoni, and DK Panda, International Workshop on OpenPOWER for HPC, held in conjunction with ISC'19, Jun 2019 [Bib - Plain]
131	Reduction Operations on Modern Supercomputers: Challenges and Solutions M. Bayatpour, J. Hashmi, S. Chakraborty, H. Subramoni, and DK Panda, ISC HIGH PERFORMANCE 2019, Jun 2019 [Best Poster Award] [Bib - Plain]
132	FALCON: Efficient Designs for Zero-copy MPI Datatype Processing on Emerging Architectures J. Hashmi, S. Chakraborty, M. Bayatpour, H. Subramoni, and DK Panda, 33rd IEEE International Parallel & Distributed Processing Symposium (IPDPS '19), May 2019 [Best Paper Finalist] [Bib - Plain]
133	C-GDR: High-Performance Container-aware GPUDirect MPI Communication Schemes on RDMA Networks J. Zhang, X. Lu, C. Chu, and DK Panda, 33rd IEEE International Parallel & Distributed Processing Symposium (IPDPS '19), May 2019 [Bib - Plain]
134	Design and Characterization of Shared Address Space MPI Collectives on Modern Architectures J. Hashmi, S. Chakraborty, M. Bayatpour, H. Subramoni, and DK Panda, The 19th Annual IEEE/ACM International Symposium in Cluster, Cloud, and Grid Computing (CCGRID 2019), May 2019 [Bib - Plain]
135	Scalable Distributed DNN Training using TensorFlow and CUDA-Aware MPI: Characterization, Designs, and Performance Evaluation Ammar Awan, J. Bedorf, C. Chu, H. Subramoni, and DK Panda, The 19th Annual IEEE/ACM International Symposium in Cluster, Cloud, and Grid Computing (CCGRID 2019), May 2019 [Bib - Plain]
136	Characterizing CUDA Unified Memory (UM)-AwareMPI Designs on Modern GPU Architectures K. Vadambacheri Manian, Ammar Awan, A. Ruhela, C. Chu, and DK Panda, 12th Workshop on General Purpose Processing Using GPU (GPGPU 2019) @ ASPLOS 2019, Apr 2019 [Bib - Plain]
137	Analyzing, Modeling, and Provisioning QoS for NVMe SSDs S. Gugnani, X. Lu, and DK Panda, 11th IEEE/ACM International Conference on Utility and Cloud Computing, Dec 2018 [Bib - Plain]
138	OC-DNN: Exploiting Advanced Unified Memory Capabilities in CUDA 9 and Volta GPUs for Out-of-Core DNN Training Ammar Awan, C. Chu, H. Subramoni, X. Lu, and DK Panda, 25th IEEE International Conference on High Performance Computing, Data, and Analytics, Dec 2018 [Bib - Plain]
139	Accelerating TensorFlow with Adaptive RDMA-based gRPC R. Biswas, X. Lu, and DK Panda, 25th IEEE International Conference on High Performance Computing, Data, and Analytics, Dec 2018 [Bib - Plain]
140	Spark-uDAPL: Cost-Saving Big Data Analytics on Microsoft Azure Cloud with RDMA Networks X. Lu, D. Shankar, H. Shi, and DK Panda, 2018 IEEE International Conference on Big Data, Dec 2018 [Short Paper] [Bib - Plain]
141	EC-Bench: Benchmarking Onload and Offload Erasure Coders on Modern Hardware Architectures H. Shi, X. Lu, and DK Panda, 2018 International Symposium on Benchmarking, Measuring and Optimizing, Dec 2018 [Best Paper Award] [Bib - Plain]
142	Cooperative Rendezvous Protocols for Improved Performance and Overlap S. Chakraborty, M. Bayatpour, J. Hashmi, H. Subramoni, and DK Panda, 2018 The International Conference for High Performance Computing, Networking, Storage, and Analysis, Nov 2018 [Best Student Paper Finalist] [Bib - Plain]
143	High-Performance Multi-Rail Erasure Coding Library over Modern Data Center Architectures: Early Experiences H. Shi, X. Lu, D. Shankar, and DK Panda, ACM Symposium on Cloud Computing (SoCC) 2018, Oct 2018 [Poster Paper] [Bib - Plain]
144	Efficient Asynchronous Communication Progress for MPI without Dedicated Resources A. Ruhela, H. Subramoni, S. Chakraborty, M. Bayatpour, P. Kousha, and DK Panda, The EuroMPI 2018 Conference, Sep 2018 [Bib - Plain]
145	Optimized Broadcast for Deep Learning Workloads on Dense-GPU InfiniBand Clusters: MPI or NCCL? Ammar Awan, C. Chu, H. Subramoni, and DK Panda, The EuroMPI 2018 Conference, Sep 2018 [Bib - Plain]
146	Multi-Threading and Lock-Free MPI RMA Based Graph Processing on KNL and POWER Architectures M. Li, X. Lu, H. Subramoni, and DK Panda, The EuroMPI 2018 Conference, Sep 2018 [Bib - Plain]
147	SALaR: Scalable and Adaptive Designs for Large Message Reduction Collectives M. Bayatpour, J. Hashmi, S. Chakraborty, H. Subramoni, P. Kousha, and DK Panda, IEEE Cluster 2018, Sep 2018 [Best Paper Award] [Bib - Plain]
148	Cutting the Tail: Designing High Performance Message Brokers to Reduce Tail Latencies in Stream Processing M. H. Javed, X. Lu, and DK Panda, IEEE Cluster 2018, Sep 2018 [Bib - Plain]
149	Designing Efficient Shared Address Space Reduction Collectives for Multi-/Many-cores J. Hashmi, S. Chakraborty, M. Bayatpour, H. Subramoni, and DK Panda, 32nd IEEE International Parallel & Distributed Processing Symposium (IPDPS '18), May 2018 [Bib - Plain]
150	Designing a Micro-Benchmark Suite to Evaluate gRPC for TensorFlow: Early Experiences R. Biswas, X. Lu, and DK Panda, The Ninth Workshop on Big Data Benchmarks, Performance Optimization, and Emerging Hardware, Mar 2018 [Bib - Plain]
151	Kernel-assisted Communication Engine for MPI on Emerging Manycore Processors J. Hashmi, K. Hamidouche, H. Subramoni, and DK Panda, 24th IEEE International Conference on High Performance Computing, Data, and Analytics (HiPC'17), Dec 2017 [Bib - Plain]
152	Designing Registration Caching Free High-Performance MPI Library with Implicit On-Demand Paging (ODP) of InfiniBand M. Li, X. Lu, H. Subramoni, and DK Panda, 24th IEEE International Conference on High Performance Computing, Data, and Analytics (HiPC'17), Dec 2017 [Bib - Plain]
153	MPI-LiFE: Designing High-Performance Linear Fascicle Evaluation of Brain Connectome with MPI S. Gugnani, X. Lu, F. Pestilli, C.F. Caiafa, and DK Panda, 24th IEEE International Conference on High Performance Computing, Data, and Analytics (HiPC'17), Dec 2017 [Bib - Plain]
154	Characterizing and Accelerating Indexing Techniques on Distributed Ordered Tables S. Gugnani, X. Lu, H. Qi, L. Zha, and DK Panda, 2017 IEEE International Conference on Big Data (IEEE Big Data 2017), Dec 2017 [Bib - Plain]
155	Performance Characterization and Acceleration of Big Data Workloads on OpenPOWER System X. Lu, H. Shi, D. Shankar, and DK Panda, 2017 IEEE International Conference on Big Data (IEEE Big Data 2017), Dec 2017 [Bib - Plain]
156	NVMD: Non-Volatile Memory Assisted Design for Accelerating MapReduce and DAG Execution Frameworks on HPC Systems M. W. Rahman, N. Islam, X. Lu, and DK Panda, 2017 IEEE International Conference on Big Data (IEEE Big Data 2017), Dec 2017 [Short Paper] [Bib - Plain]
157	Is Singularity-based Container Technology Ready for Running MPI Applications on HPC Clouds? J. Zhang, X. Lu, and DK Panda, 10th IEEE/ACM International Conference on Utility and Cloud Computing, Dec 2017 [Best Student Paper Award] [Bib - Plain]
158	Characterization of Big Data Stream Processing Pipeline: A Case Study using Flink and Kafka M. H. Javed, X. Lu, and DK Panda, 4th IEEE/ACM International Conference on Big Data Computing, Applications and Technologies, Dec 2017 [Bib - Plain]
159	An In-depth Performance Characterization of CPU- and GPU-based DNN Training on Modern Architectures Ammar Awan, H. Subramoni, and DK Panda, 3rd Workshop on Machine Learning in High Performance Computing Environments, held in conjunction with SC17, Nov 2017 [Bib - Plain]
160	Scalable Reduction Collectives with Data Partitioning-based Multi-Leader Design M. Bayatpour, S. Chakraborty, H. Subramoni, X. Lu, and DK Panda, SuperComputing 2017, Nov 2017 [Bib - Plain]
161	Performance of PGAS Models on KNL: A Comprehensive Study with MVAPICH2-X J. Hashmi, M. Li, H. Subramoni, and DK Panda, Intel Xeon Phi User's Group (IXPUG) 2017, Sep 2017 [Bib - Plain]
162	Advancing MPI Libraries to the Many-core Era: Designs and Evaluations with MVAPICH2 S. Chakraborty, M. Bayatpour, H. Subramoni, and DK Panda, Intel Xeon Phi User's Group (IXPUG) 2017, Sep 2017 [Bib - Plain]
163	MPI Performance Engineering with the MPI Tool Interface: the Integration of MVAPICH and TAU DK Panda, 24th European MPI Users' Group Meeting, Sep 2017 [Best Paper] [Bib - Plain]
164	Contention Aware Kernel-Assisted MPI Collectives for Multi/Many-core Systems S. Chakraborty, H. Subramoni, and DK Panda, 2017 IEEE International Conference on Cluster Computing, Sep 2017 [Best Paper Finalist] [Bib - Plain]
165	Characterizing Deep Learning over Big Data (DLoBD) Stacks on RDMA-capable Networks X. Lu, H. Shi, M. H. Javed, R. Biswas, and DK Panda, The 25th Annual Symposium on High-Performance Interconnects (HotI), Aug 2017 [Bib - Plain]
166	Efficient and Scalable Multi-Source Streaming Broadcast on GPU Clusters for Deep Learning C. Chu, X. Lu, Ammar Awan, H. Subramoni, J. Hashmi, Bracy Elton, and DK Panda, ICPP 2017 : International Conference on Parallel Processing, Aug 2017 [Bib - Plain]
167	MPI-GDS: High Performance MPI Designs with GPUDirect-aSync for CPU-GPU Control Flow Decoupling A. Venkatesh, C. Chu, K. Hamidouche, S. Potluri, Davide Rossetti, and DK Panda, ICPP 2017 : International Conference on Parallel Processing, Aug 2017 [Bib - Plain]
168	Exploiting and Evaluating OpenSHMEM on KNL Architecture J. Hashmi, M. Li, H. Subramoni, and DK Panda, Fourth Workshop on OpenSHMEM and Related Technologies, Aug 2017 [Bib - Plain]
169	Designing Dynamic and Adaptive MPI Point-to-point Communication Protocols for Efficient Overlap of Computation and Communication H. Subramoni, S. Chakraborty, and DK Panda, International Supercomputing Conference (ISC ’17), Jun 2017 [Hans Meuer Award (Most Outstanding Research Paper)] [Bib - Plain]
170	High-Performance and Resilient Key-Value Store with Online Erasure Coding for Big Data Workloads D. Shankar, X. Lu, and DK Panda, 37th IEEE International Conference on Distributed Computing Systems (ICDCS 2017), Jun 2017 [Bib - Plain]
171	High-Performance Virtual Machine Migration Framework for MPI Applications on SR-IOV enabled InfiniBand Clusters J. Zhang, X. Lu, and DK Panda, 31st IEEE International Parallel & Distributed Processing Symposium (IPDPS '17), May 2017 [Bib - Plain]
172	Swift-X: Accelerating OpenStack Swift with RDMA for Building an Efficient HPC Cloud S. Gugnani, X. Lu, and DK Panda, 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid '17), May 2017 [Bib - Plain]
173	Benchmarking Kudu Distributed Storage Engine on High-Performance Interconnects and Storage Devices N. Islam, M. W. Rahman, X. Lu, and DK Panda, The 8th Workshop on Big Data Benchmarks, Performance, Optimization, and Emerging Hardware (BPOE-8), Apr 2017 [Bib - Plain]
174	Designing Locality and NUMA Aware MPI Runtime for Nested Virtualization based HPC Cloud with SR-IOV Enabled InfiniBand J. Zhang, X. Lu, and DK Panda, 13th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments (VEE '17), Apr 2017 [Bib - Plain]
175	NRCIO: NVM-aware RDMA-based Communication and I/O Schemes for Big Data Analytics X. Lu, N. Islam, M. W. Rahman, and DK Panda, The 8th Annual Non-Volatile Memories Workshop (NVMW '17), Mar 2017 [Bib - Plain]
176	S-Caffe: Co-designing MPI Runtimes and Caffe for Scalable Deep Learning on Modern GPU Clusters Ammar Awan, K. Hamidouche, J. Hashmi, and DK Panda, 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Feb 2017 [Slides] [Bib - Plain]
177	Mizan-RMA: Accelerating Mizan Graph Processing Framework with MPI RMA M. Li, X. Lu, K. Hamidouche, J. Zhang, and DK Panda, 23rd IEEE International Conference on High Performance Computing, Data, and Analytics, Dec 2016 [Bib - Plain]
178	CUDA M3: Designing Efficient CUDA Managed Memory-aware MPI by Exploiting GDR and IPC K. Hamidouche, Ammar Awan, A. Venkatesh, and DK Panda, 23rd IEEE International Conference on High Performance Computing, Data, and Analytics, Dec 2016 [Bib - Plain]
179	Re-designing CNTK Deep Learning Framework on Modern GPU Enabled Clusters D. Banerjee, K. Hamidouche, and DK Panda, 8th IEEE International Conference on Cloud Computing Technology and Science (IEEE CloudCom '16), Dec 2016 [Bib - Plain]
180	Designing Virtualization-aware and Automatic Topology Detection Schemes for Accelerating Hadoop on SR-IOV-enabled Clouds S. Gugnani, X. Lu, and DK Panda, 8th IEEE International Conference on Cloud Computing Technology and Science (IEEE CloudCom '16), Dec 2016 [Bib - Plain]
181	Impact of HPC Cloud Networking Technologies on Accelerating Hadoop RPC and HBase X. Lu, D. Shankar, S. Gugnani, H. Subramoni, and DK Panda, 8th IEEE International Conference on Cloud Computing Technology and Science (IEEE CloudCom '16), Dec 2016 [Bib - Plain]
182	Enabling Performance Efficient Runtime Support for Hybrid MPI+UPC++ Programming Models J. Hashmi, K. Hamidouche, and DK Panda, 18th IEEE International Conference on High Performance Computing and Communications (HPCC'16), Dec 2016 [Bib - Plain]
183	Performance Characterization of Hadoop Workloads on SR-IOV-enabled Virtualized InfiniBand Clusters S. Gugnani, X. Lu, and DK Panda, 3rd IEEE/ACM International Conference on Big Data Computing, Applications and Technologies (BDCAT'16), Dec 2016 [Bib - Plain]
184	Efficient Data Access Strategies for Hadoop and Spark on HPC Cluster with Heterogeneous Storage N. Islam, M. W. Rahman, X. Lu, and DK Panda, 2016 IEEE International Conference on Big Data, Dec 2016 [Bib - Plain]
185	High-Performance Design of Apache Spark with RDMA and Its Benefits on Various Workloads X. Lu, D. Shankar, S. Gugnani, and DK Panda, 2016 IEEE International Conference on Big Data, Dec 2016 [Bib - Plain]
186	Boldio: A Hybrid and Resilient Burst-Buffer Over Lustre for Accelerating Big Data I/O D. Shankar, X. Lu, and DK Panda, 2016 IEEE International Conference on Big Data, Dec 2016 [Short Paper] [Bib - Plain]
187	Efficient Reliability Support for Hardware Multicast-based Broadcast in GPU-enabled Streaming Applications C. Chu, K. Hamidouche, H. Subramoni, A. Venkatesh, B. Elton, and DK Panda, First Workshop on Optimization of Communication in HPC runtime systems (COMHPC, SC Workshop), Nov 2016 [Bib - Plain]
188	OpenSHMEM NonBlocking Data Movement Operations with MVAPICH2-X: Early Experiences K. Hamidouche, J. Zhang, K. Tomko, and DK Panda, PGAS Applications Workshop, Nov 2016 [Bib - Plain]
189	Can Non-Volatile Memory Benefit MapReduce Applications on HPC Clusters? M. W. Rahman, N. Islam, X. Lu, and DK Panda, First Joint International Workshop on Parallel Data Storage & Data Intensive Scalable Computing Systems (PDSW-DISCS, SC Workshop), Nov 2016 [Bib - Plain]
190	Designing MPI Library with On-Demand Paging (ODP) of InfiniBand: Challenges and Benefits M. Li, K. Hamidouche, X. Lu, H. Subramoni, J. Zhang, and DK Panda, SuperComputing 2016, Nov 2016 [Bib - Plain]
191	Designing High Performance Heterogeneous Broadcast for Streaming Applications on GPU Clusters C. Chu, K. Hamidouche, H. Subramoni, A. Venkatesh, B. Elton, and DK Panda, 28th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD'16), Oct 2016 [Bib - Plain]
192	MR-Advisor: A Comprehensive Tuning Tool for Advising HPC Users to Accelerate MapReduce Applications on Supercomputers M. W. Rahman, N. Islam, X. Lu, D. Shankar, and DK Panda, 28th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD'16), Oct 2016 [Bib - Plain]
193	Efficient Large Message Broadcast using NCCL and CUDA-Aware MPI for Deep Learning Ammar Awan, K. Hamidouche, A. Venkatesh, and DK Panda, The 23rd European MPI Users' Group Meeting (EuroMPI 16), Sep 2016 [Best Paper Runner-Up] [Bib - Plain]
194	Adaptive and Dynamic Design for MPI Tag Matching M. Bayatpour, H. Subramoni, S. Chakraborty, and DK Panda, IEEE Cluster 2016, Sep 2016 [Best Paper Nominee] [Bib - Plain]
195	SLURM-V: Extending SLURM for Building Efficient HPC Cloud with SR-IOV and IVShmem J. Zhang, X. Lu, S. Chakraborty, and DK Panda, 22nd International European Conference on Parallel and Distributed Computing (Euro-Par '16), Aug 2016 [Bib - Plain]
196	High Performance MPI Library for Container-based HPC Cloud on InfiniBand Clusters J. Zhang, X. Lu, and DK Panda, The 45th International Conference on Parallel Processing (ICPP '16), Aug 2016 [Bib - Plain]
197	Experiences and Benefits of Running RDMA Hadoop and Spark on SDSC Comet M. Tatineni, X. Lu, D. J. Choi, A. Majumdar, and DK Panda, The 5th Annual Conference on Extreme Science and Engineering Discovery Environment (XSEDE), Jul 2016 [Bib - Plain]
198	INAM^2: InfiniBand Network Analysis & Monitoring with MPI H. Subramoni, A. Augustine, M. Arnold, J. Perkins, X. Lu, K. Hamidouche, and DK Panda, International Supercomputing Conference, Jun 2016 [Slides] [Bib - Plain]
199	High Performance Design for HDFS with Byte-Addressability of NVM and RDMA N. Islam, M. W. Rahman, X. Lu, and DK Panda, 24th International Conference on Supercomputing (ICS '16), Jun 2016 [Bib - Plain]
200	Performance Characterization of Hypervisor- and Container-based Virtualization for HPC on SR-IOV Enabled InfiniBand Clusters J. Zhang, X. Lu, and DK Panda, IPDRM '16 (IPDPS Workshop), May 2016 [Bib - Plain]
201	High-Performance Hybrid Key-Value Store on Modern Clusters with RDMA Interconnects and SSDs: Non-blocking Extensions, Designs, and Benefits D. Shankar, X. Lu, N. Islam, M. W. Rahman, and DK Panda, The 30th IEEE International Parallel & Distributed Processing Symposium (IPDPS '16), May 2016 [Bib - Plain]
202	Exploiting Maximal Overlap for Non-Contiguous Data Movement Processing on Modern GPU-enabled System C. Chu, K. Hamidouche, A. Venkatesh, D. Banerjee, H. Subramoni, and DK Panda, The 30th IEEE International Parallel & Distributed Processing Symposium (IPDPS '16), May 2016 [Bib - Plain]
203	CUDA Kernel based Collective Reduction Operations on Large-scale GPU Clusters C. Chu, K. Hamidouche, A. Venkatesh, Ammar Awan, and DK Panda, 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid'16), May 2016 [Bib - Plain]
204	SHMEMPMI - Shared Memory based PMI for Improved Performance and Scalability S. Chakraborty, H. Subramoni, J. Perkins, and DK Panda, 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid'16), May 2016 [Bib - Plain]
205	Characterizing Cloudera Impala Workloads with BigDataBench on InfiniBand Clusters K. Kulkarni, X. Lu, and DK Panda, The 7th Workshop on Big Data Benchmarks, Performance, Optimization, and Emerging Hardware (BPOE-7), Apr 2016 [Bib - Plain]
206	Offloaded GPU Collectives using CORE-Direct and CUDA Capabilities on IB Clusters A. Venkatesh, K. Hamidouche, H. Subramoni, and DK Panda, 22nd IEEE International Conference on High Performance Computing, Dec 2015 [Bib - Plain]
207	High Performance OpenSHMEM Strided Communication Support with InfiniBand UMR M. Li, K. Hamidouche, X. Lu, J. Zhang, J. Lin, and DK Panda, HiPC '15, Dec 2015 [Bib - Plain]
208	A Case for Application-Oblivious Energy-Efficient MPI Runtime A. Venkatesh, A. Vishnu, K. Hamidouche, N. Tallent, DK Panda, D. Kerbyson, and A. Hoise, Supercomputing 2015, Nov 2015 [Best Student Paper Finalist] [Bib - Plain]
209	Performance Characterization and Acceleration of In-Memory File Systems for Hadoop and Spark Applications on HPC Clusters N. Islam, M. W. Rahman, X. Lu, D. Shankar, and DK Panda, 2015 IEEE International Conference on Big Data, Oct 2015 [Bib - Plain]
210	Benchmarking Key-Value Stores on High-Performance Storage and Interconnects for Web-Scale Workloads D. Shankar, X. Lu, M. W. Rahman, N. Islam, and DK Panda, 2015 IEEE International Conference on Big Data, Oct 2015 [Short Paper] [Bib - Plain]
211	GPU-Aware Design, Implementation, and Evaluation of Non-blocking Collective Benchmarks Ammar Awan, K. Hamidouche, A. Venkatesh, J. Perkins, H. Subramoni, and DK Panda, EuroMPI 2015, Sep 2015 [Bib - Plain]
212	High Performance MPI Datatype Support with User-mode Memory Registration: Challenges, Designs and Benefits M. Li, H. Subramoni, K. Hamidouche, X. Lu, and DK Panda, IEEE Cluster 2015, Sep 2015 [Bib - Plain]
213	Exploiting GPUDirect RDMA in Designing High Performance OpenSHMEM for NVIDIA GPU Clusters K. Hamidouche, A. Venkatesh, Ammar Awan, H. Subramoni, and DK Panda, IEEE Cluster 2015, Sep 2015 [Bib - Plain]
214	Accelerating I/O Performance of Big Data Analytics on HPC Clusters through RDMA-based Key-Value Store N. Islam, D. Shankar, X. Lu, M. W. Rahman, and DK Panda, The 44th International Conference on Parallel Processing (ICPP '15), Sep 2015 [Bib - Plain]
215	A Plugin-based Approach to Exploit RDMA Benefits for Apache and Enterprise HDFS A. Bhat, N. Islam, X. Lu, M. W. Rahman, D. Shankar, and DK Panda, The Sixth workshop on Big Data Benchmarks, Performance Optimization, and Emerging Hardware, Aug 2015 [Bib - Plain]
216	Impact of InfiniBand DC Transport Protocol on Energy Consumption of All-to-all Collective Algorithms H. Subramoni, A. Venkatesh, K. Hamidouche, K. Tomko, and DK Panda, 23rd International Symposium on High Performance Interconnects 2015, Aug 2015 [Bib - Plain]
217	High Performance and Scalable Design of MPI-3 RMA on Xeon Phi Clusters M. Li, K. Hamidouche, X. Lu, J. Lin, and DK Panda, Euro-Par '2015, Aug 2015 [Bib - Plain]
218	A Case for Non-Blocking Collectives in OpenSHMEM: Design, Implementation, and Performance Evaluation using MVAPICH2-X Ammar Awan, K. Hamidouche, C. Chu, and DK Panda, OpenSHMEM 2015 for PGAS Programming in the Exascale Era, Aug 2015 [Bib - Plain]
219	Accelerating k-NN Algorithm with Hybrid MPI and OpenSHMEM J. Lin, K. Hamidouche, J. Zhang, X. Lu, A. Vishnu, and DK Panda, OpenSHMEM 2015 for PGAS Programming in the Exascale Era, Aug 2015 [Bib - Plain]
220	Designing Non-Blocking Personalized Collectives with Near Perfect Overlap for RDMA-Enabled Clusters H. Subramoni, Ammar Awan, K. Hamidouche, D. Pekurovsky, A. Venkatesh, S. Chakraborty, K. Tomko, and DK Panda, ISC '15, Jul 2015 [Bib - Plain]
221	On-demand Connection Management for OpenSHMEM and OpenSHMEM+MPI S. Chakraborty, H. Subramoni, J. Perkins, Ammar Awan, and DK Panda, HIPS '15 (IPDPS Workshop), May 2015 [Bib - Plain]
222	High-Performance Coarray Fortran Support with MVAPICH2-X: Initial Experience and Evaluation J. Lin, K. Hamidouche, X. Lu, M. Li, and DK Panda, HIPS '15 (IPDPS Workshop), May 2015 [Bib - Plain]
223	High-Performance Design of YARN MapReduce on Modern HPC Clusters with Lustre and RDMA M. W. Rahman, X. Lu, N. Islam, R. Rajachandrasekar, and DK Panda, IPDPS '15, May 2015 [Bib - Plain]
224	Triple-H: A Hybrid Approach to Accelerate HDFS on HPC Clusters with Heterogeneous Storage Architecture N. Islam, X. Lu, M. W. Rahman, D. Shankar, and DK Panda, CCGrid '15, May 2015 [Bib - Plain]
225	Non-blocking PMI Extensions for Fast MPI Startup S. Chakraborty, H. Subramoni, A. Moody, A. Venkatesh, J. Perkins, and DK Panda, CCGrid '15, May 2015 [Bib - Plain]
226	MVAPICH2 over OpenStack with SR-IOV: An Efficient Approach to Build HPC Clouds J. Zhang, X. Lu, M. Arnold, and DK Panda, CCGrid '15, May 2015 [Bib - Plain]
227	Power-Check: An Energy-Efficient Checkpointing Framework for HPC Clusters R. Rajachandrasekar, A. Venkatesh, K. Hamidouche, and DK Panda, CCGrid '15, May 2015 [Bib - Plain]
228	Can RDMA Benefit On-Line Data Processing Workloads with Memcached and MySQL D. Shankar, X. Lu, J. Jose, M. W. Rahman, N. Islam, and DK Panda, ISPASS '15, Mar 2015 [Poster Paper] [Bib - Plain]
229	Designing High Performance Communication Runtime for GPU Managed Memory: Early Experiences D. Banerjee, K. Hamidouche, and DK Panda, General Purpose GPU (GPGPU-9), Mar 2015 [Bib - Plain]
230	Designing Efficient Small Message Transfer Mechanism for Inter-node MPI Communication on InfiniBand GPU Clusters R. Shi, S. Potluri, K. Hamidouche, M. Li, J. Perkins, D. Rossetti, and DK Panda, IEEE International Conference on High Performance Computing (HiPC ’14), Dec 2014 [Bib - Plain]
231	A High Performance Broadcast Design with Hardware Multicast and GPUDirect RDMA for Streaming Applications on Infiniband Clusters A. Venkatesh, H. Subramoni, K. Hamidouche, and DK Panda, IEEE International Conference on High Performance Computing (HiPC ’14), Dec 2014 [Bib - Plain]
232	High Performance MPI Library over SR-IOV Enabled InfiniBand Clusters J. Zhang, X. Lu, J. Jose, M. Li, R. Shi, and DK Panda, IEEE International Conference on High Performance Computing (HiPC ’14), Dec 2014 [Bib - Plain]
233	In-Memory I/O and Replication for HDFS with Memcached: Early Experiences N. Islam, X. Lu, M. W. Rahman, R. Rajachandrasekar, and DK Panda, IEEE BigData'14, Oct 2014 [Short Paper] [Bib - Plain]
234	Scalable MiniMD Design with Hybrid MPI and OpenSHMEM M. Li, J. Lin, X. Lu, K. Hamidouche, K. Tomko, and DK Panda, OUG '14 (Co-located with PGAS), Oct 2014 [Bib - Plain]
235	Designing Scalable Out-of-core Sorting with Hybrid MPI+PGAS Programming Models J. Jose, S. Potluri, H. Subramoni, X. Lu, K. Hamidouche, K. Schulz, H. Sundar, and DK Panda, International Conference on Partitioned Global Address Space Programming Models (PGAS '14), Oct 2014 [Bib - Plain]
236	PMI Extensions for Scalable MPI Startup S. Chakraborty, H. Subramoni, J. Perkins, A. Moody, M. Arnold, and DK Panda, EuroMPI/ASIA 2014, Sep 2014 [Bib - Plain]
237	Understanding the Memory-Utilization of MPI Libraries: Challenges and Designs in Implementing the MPI_T Interface R. Rajachandrasekar, J. Perkins, K. Hamidouche, M. Arnold, and DK Panda, EuroMPI/ASIA 2014, Sep 2014 [Bib - Plain]
238	HAND: A Hybrid Approach to Accelerate Non-contiguous Data Movement using MPI Datatypes on GPU Clusters R. Shi, X. Lu, S. Potluri, K. Hamidouche, J. Zhang, and DK Panda, International Conference on Parallel Processing (ICPP’14), Sep 2014 [Bib - Plain]
239	Designing Topology-Aware Communication Schedules for Alltoall Operations in Large InfiniBand Clusters H. Subramoni, K. Kandalla, J. Jose, K. Tomko, K. Schulz, D. Pekurovsky, and DK Panda, International Conference on Parallel Processing (ICPP’14), Sep 2014 [Bib - Plain]
240	A Micro-benchmark Suite for Evaluating Hadoop MapReduce on High-Performance Networks D. Shankar, X. Lu, M. W. Rahman, N. Islam, and DK Panda, The 5th Workshop on Big Data Benchmarks, Performance Optimization, and Emerging Hardware (BPOE-5), Sep 2014 [Bib - Plain]
241	Performance Modeling for RDMA-Enhanced Hadoop MapReduce M. W. Rahman, X. Lu, N. Islam, and DK Panda, 43rd International Conference on Parallel Processing (ICPP), Sep 2014 [Bib - Plain]
242	High Performance OpenSHMEM for MIC Clusters: Extensions, Runtime Designs, and Application Co-Design J. Jose, K. Hamidouche, X. Lu, S. Potluri, J. Zhang, K. Tomko, and DK Panda, IEEE CLUSTER’14, Sep 2014 [Bib - Plain]
243	Scalable Graph500 Design with MPI-3 RMA M. Li, X. Lu, S. Potluri, K. Hamidouche, J. Jose, K. Tomko, and DK Panda, IEEE CLUSTER’14, Sep 2014 [Bib - Plain]
244	MapReduce over Lustre: Can RDMA-based Approach Benefit? M. W. Rahman, X. Lu, N. Islam, R. Rajachandrasekar, and DK Panda, 20th International European Conference on Parallel Processing (Euro-Par), Aug 2014 [Bib - Plain]
245	Accelerating Spark with RDMA for Big Data Processing: Early Experiences X. Lu, M. W. Rahman, N. Islam, D. Shankar, and DK Panda, International Symposium on High Performance Interconnects (HotI'14), Aug 2014 [Bib - Plain]
246	Can Inter-VM Shmem Benefit MPI Applications on SR-IOV based Virtualized InfiniBand Clusters? J. Zhang, X. Lu, J. Jose, R. Shi, and DK Panda, Euro-Par 2014 Parallel Processing, Aug 2014 [Bib - Plain]
247	HOMR: A Hybrid Approach to Exploit Maximum Overlapping in MapReduce over High Performance Interconnects M. W. Rahman, X. Lu, N. Islam, and DK Panda, International Conference on Supercomputing (ICS '14), Jun 2014 [Bib - Plain]
248	SOR-HDFS: A SEDA-based Approach to Maximize Overlapping in RDMA-Enhanced HDFS N. Islam, X. Lu, M. W. Rahman, and DK Panda, ACM Symposium on High-Performance Parallel and Distributed Computing (HPDC '14), Short Paper, Jun 2014 [Bib - Plain]
249	MIC-Check: A Distributed Checkpointing Framework for the Intel Many Integrated Cores Architecture R. Rajachandrasekar, S. Potluri, A. Venkatesh, K. Hamidouche, M. W. Rahman, and DK Panda, International Symposium on High Performance and Distributed Computing (HPDC), Jun 2014 [Bib - Plain]
250	Designing MPI Library with Dynamic Connected Transport (DCT) of InfiniBand : Early Experiences H. Subramoni, K. Hamidouche, A. Venkatesh, S. Chakraborty, and DK Panda, IEEE International Supercomputing Conference (ISC ’14), Jun 2014 [Bib - Plain]
251	High Performance Alltoall and Allgather designs for InfiniBand MIC Clusters A. Venkatesh, S. Potluri, R. Rajachandrasekar, M. Luo, K. Hamidouche, and DK Panda, International Parallel and Distributed Processing Symposium (IPDPS’14), May 2014 [Bib - Plain]
252	Optimizing Collective Communication in UPC J. Jose, K. Hamidouche, J. Zhang, A. Venkatesh, and DK Panda, International Workshop on High-Level Parallel Programming Models and Supportive Environments (HIPS '14), May 2014 [Slides] [Bib - Plain]
253	A Comprehensive Performance Evaluation of OpenSHMEM Libraries on InfiniBand Clusters J. Jose, J. Zhang, A. Venkatesh, S. Potluri, and DK Panda, OpenSHMEM Workshop, Mar 2014 [Bib - Plain]
254	Initial Study of Multi-Endpoint Runtime for MPI+OpenMP Hybrid Programming Model on Multi-Core Systems M. Luo, X. Lu, K. Hamidouche, K. Kandalla, and DK Panda, International Symposium on Principles and Practice of Parallel Programming (PPoPP '14), Feb 2014 [Bib - Plain]
255	The MVAPICH Project: Evolution and Sustainability of an Open Source Production Quality MPI Library for HPC DK Panda, K. Tomko, K. Schulz, and A. Majumdar, Int'l Workshop on Sustainable Software for Science: Practice and Experiences, Nov 2013 [Bib - Plain]
256	MVAPICH-PRISM: A Proxy-based Communication Framework using InfiniBand and SCIF for Intel MIC Clusters S. Potluri, D. Bureddy, K. Hamidouche, A. Venkatesh, K. Kandalla, H. Subramoni, and DK Panda, Internationall Conference on Supercomputing (SC 2013), Nov 2013 [Bib - Plain]
257	Does RDMA-based Enhanced Hadoop MapReduce Need a New Performance Model? M. W. Rahman, X. Lu, N. Islam, and DK Panda, ACM Symposium on Cloud Computing (SoCC '13), Poster Paper, Oct 2013 [Bib - Plain]
258	High-Performance Design of Hadoop RPC with RDMA over InfiniBand X. Lu, N. Islam, M. W. Rahman, J. Jose, H. Subramoni, H. Wang, and DK Panda, International Conference on Parallel Processing (ICPP '13), Oct 2013 [Bib - Plain]
259	A Novel Functional Partitioning Approach to Design High-Performance MPI-3 Non-Blocking Alltoallv Collective on Multi-core Systems K. Kandalla, H. Subramoni, K. Tomko, D. Pekurovsky, and DK Panda, International Conference on Parallel Processing (ICPP '13), Oct 2013 [Bib - Plain]
260	Efficient Inter-node MPI Communication using GPUDirect RDMA for InfiniBand Clusters with NVIDIA GPUs S. Potluri, K. Hamidouche, A. Venkatesh, D. Bureddy, and DK Panda, International Conference on Parallel Processing (ICPP '13), Oct 2013 [Bib - Plain]
261	UPC on MIC: Early Experiences with Native and Symmetric Modes M. Luo, M. Li, A. Venkatesh, X. Lu, and DK Panda, International Conference on Partitioned Global Address Space Programming Models (PGAS '13), Oct 2013 [Bib - Plain]
262	Optimizing Collective Communication in OpenSHMEM J. Jose, K. Kandalla, S. Potluri, J. Zhang, and DK Panda, International Conference on Partitioned Global Address Space Programming Models (PGAS '13), Oct 2013 [Bib - Plain]
263	Design of Network Topology Aware Scheduling Services for Large InfiniBand Clusters H. Subramoni, D. Bureddy, K. Kandalla, K. Schulz, B. Barth, J. Perkins, M. Arnold, and DK Panda, IEEE Cluster (Cluster '13), Sep 2013 [Bib - Plain]
264	A Scalable and Portable Approach to Accelerate Hybrid HPL on Heterogeneous CPU-GPU Clusters R. Shi, S. Potluri, K. Hamidouche, X. Lu, K. Tomko, and DK Panda, IEEE Cluster (Cluster '13), Sep 2013 [Bib - Plain]
265	Efficient and Truly Passive MPI-3 RMA Using InfiniBand Atomics M. Li, S. Potluri, K. Hamidouche, J. Jose, and DK Panda, EuroMPI 2013, Sep 2013 [Slides] [Bib - Plain]
266	Can Parallel Replication Benefit HDFS for High-Performance Interconnects? N. Islam, X. Lu, M. W. Rahman, and DK Panda, International Symposium on High-Performance Interconnects (HotI '13), Aug 2013 [Bib - Plain]
267	Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters K. Kandalla, A. Venkatesh, K. Hamidouche, S. Potluri, and DK Panda, International Symposium on High-Performance Interconnects (HotI '13), Aug 2013 [Bib - Plain]
268	MVAPICH2-MIC: A High-Performance MPI Library for Xeon Phi Clusters with InfiniBand S. Potluri, K. Hamidouche, D. Bureddy, and DK Panda, Extreme Scaling Workshop, Aug 2013 [Bib - Plain]
269	Optimized MPI Gather collective for Many Integrated Core (MIC) InfiniBand Clusters A. Venkatesh, K. Kandalla, and DK Panda, Extreme Scaling Workshop, Aug 2013 [Bib - Plain]
270	A Micro-Benchmark Suite for Evaluating Hadoop RPC on High-Performance Networks X. Lu, M. W. Rahman, N. Islam, and DK Panda, International Workshop on Big Data Benchmarking (WBDB '13), Jul 2013 [Bib - Plain]
271	A 1PB/s File System to Checkpoint Three Million MPI Tasks R. Rajachandrasekar, A. Moody, K. Mohror, and DK Panda, International Conference on High Performance Distributed Computing (HPDC '13), Jun 2013 [Slides] [Bib - Plain]
272	Designing Scalable Graph500 Benchmark with Hybrid MPI+OpenSHMEM Programming Models J. Jose, S. Potluri, K. Tomko, and DK Panda, International Supercomputing Conference (ISC '13), Jun 2013 [Slides] [Bib - Plain]
273	MIC-RO: Enabling Efficient Remote Offload on Heterogeneous Many Integrated Core (MIC) Clusters with InfiniBand K. Hamidouche, S. Potluri, H. Subramoni, K. Kandalla, and DK Panda, International Conference on Supercomputing (ICS '13), Jun 2013 [Bib - Plain]
274	High-Performance RDMA-based Design of Hadoop MapReduce over InfiniBand M. W. Rahman, N. Islam, X. Lu, J. Jose, H. Subramoni, H. Wang, and DK Panda, International Workshop on High Performance Data Intensive Computing (HPDIC), May 2013 [Bib - Plain]
275	A Micro-benchmark Suite for Evaluating HDFS Operations on Modern Clusters N. Islam, X. Lu, M. W. Rahman, J. Jose, and DK Panda, Special Issue of LNCS on papers from WBDB '12 Workshop., May 2013 [Bib - Plain]
276	Extending OpenSHMEM for GPU Computing S. Potluri, D. Bureddy, H. Wang, H. Subramoni, and DK Panda, International Parallel and Distributed Processing Symposium (IPDPS '13), May 2013 [Slides] [Bib - Plain]
277	Evaluation of Energy Characteristics of MPI Communication Primitives with RAPL A. Venkatesh, K. Kandalla, and DK Panda, International Workshop on High Performance (High-Performance, Power-Aware Computing Workshop), May 2013 [Bib - Plain]
278	High Performance RDMA-Based Design of HDFS over InfiniBand N. Islam, M. W. Rahman, J. Jose, R. Rajachandrasekar, H. Wang, H. Subramoni, C. Murthy, and DK Panda, International Conference on Supercomputing (SC '12), Nov 2012 [Slides] [Bib - Plain]
279	Design of a Scalable InfiniBand Topology Service to Enable Network-Topology-Aware Placement of Processes H. Subramoni, S. Potluri, K. Kandalla, B. Barth, J. Vienne, J. Keasler, K. Tomko, K. Schulz, A. Moody, and DK Panda, International Conference on Supercomputing (SC '12), Nov 2012 [Bib - Plain]
280	Multi-Threaded UPC Runtime for GPU to GPU communication over InfiniBand M. Luo, H. Wang, and DK Panda, International Conference on Partitioned Global Address Space Programming Models (PGAS '12), Oct 2012 [Slides] [Bib - Plain]
281	SSD-Assisted Hybrid Memory to Accelerate Memcached over High Performance Networks X. Ouyang, N. Islam, R. Rajachandrasekar, J. Jose, M. Luo, H. Wang, and DK Panda, International Conference on Parallel Processing (ICPP '12), Sep 2012 [Bib - Plain]
282	Supporting Hybrid MPI and OpenSHMEM over InfiniBand: Design and Performance Evaluation J. Jose, K. Kandalla, M. Luo, and DK Panda, International Conference on Parallel Processing (ICPP '12), Sep 2012 [Bib - Plain]
283	OMB-GPU: A Micro-benchmark suite for Evaluating MPI Libraries on GPU Clusters D. Bureddy, H. Wang, A. Venkatesh, S. Potluri, and DK Panda, EuroMPI 2012, Sep 2012 [Bib - Plain]
284	Minimizing Network Contention in InfiniBand Clusters with a QoS-Aware Data-Staging Framework R. Rajachandrasekar, J. Jaswani, H. Subramoni, and DK Panda, IEEE Cluster (Cluster '12), Sep 2012 [Bib - Plain]
285	Can Network-Offload based Non-Blocking Neighborhood MPI Collectives Improve Communication Overheads of Irregular Graph Algorithms? Int'l Workshop on Parallel Algorithm and Parallel Software (IWPAPS12) K. Kandalla, H. Subramoni, K. Tomko, J. Vienne, L. Oliker, and DK Panda, held in conjunction with IEEE Cluster (Cluster '12), Sep 2012 [Bib - Plain]
286	A Scalable InfiniBand Network-Topology-Aware Performance Analysis Tool for MPI H. Subramoni, J. Vienne, and DK Panda, International Workshop on Productivity and Performance (Proper '12), Aug 2012 [Bib - Plain]
287	Performance Analysis and Evaluation of InfiniBand FDR and 40GigE RoCE on HPC and Cloud Computing System J. Vienne, J. Chen, M. W. Rahman, N. Islam, H. Subramoni, and DK Panda, International Symposium on High-Performance Interconnects (HotI 2012), Aug 2012 [Bib - Plain]
288	Congestion Avoidance on Manycore High Performance Computing Systems M. Luo, DK Panda, C. Iancu, and K. Z. Ibrahim, International Conference on Supercomputing (ICS '12), Jun 2012 [Bib - Plain]
289	Redesigning MPI Shared Memory Communication for Large Multi-Core Architecture M. Luo, H. Wang, J. Vienne, and DK Panda, International Supercomputing Conference 2012, Jun 2012 [Bib - Plain]
290	High-Performance Design of HBase with RDMA over InfiniBand J. Huang, X. Ouyang, J. Jose, M. W. Rahman, H. Wang, M. Luo, H. Subramoni, C. Murthy, and DK Panda, International Parallel and Distributed Processing Symposium (IPDPS '12), May 2012 [Bib - Plain]
291	Designing Non-blocking Allreduce with Collective Offload on InfiniBand Clusters: A Case Study with Conjugate Gradient Solvers K. Kandalla, U. Yang, J. Keasler, T. Kolev, A. Moody, H. Subramoni, K. Tomko, J. Vienne, and DK Panda, International Parallel and Distributed Processing Symposium 2012, May 2012 [Bib - Plain]
292	Designing Network Failover and Recovery in MPI for Multi-Rail InfiniBand Clusters S. P. Raikar, H. Subramoni, K. Kandalla, J. Vienne, and DK Panda, International Workshop on System Management Techniques, May 2012 [Bib - Plain]
293	Monitoring and Predicting Hardware Failures in HPC Clusters with FTB-IPMI R. Rajachandrasekar, X. Besseron, and DK Panda, International Workshop on System Management Techniques, May 2012 [Bib - Plain]
294	Optimizing MPI Communication on Multi-GPU Systems using CUDA Inter-Process Communication S. Potluri, H. Wang, D. Bureddy, A. Singh, C. Rosales, and DK Panda, International Workshop on Accelerators and Hybrid Exascale Systems (AsHES), May 2012 [Slides] [Bib - Plain]
295	Understanding the Communication Characteristics in HBase: What are the Fundamental Bottlenecks? M. W. Rahman, J. Huang, J. Jose, X. Ouyang, H. Wang, N. Islam, H. Subramoni, C. Murthy, and DK Panda, International Symposium on Performnce Analysis of Systems and Software (ISPASS '12), Poster Paper, Apr 2012 [Bib - Plain]
296	Intra-MIC MPI Communication using MVAPICH2: Early Experience S. Potluri, K. Tomko, D. Bureddy, and DK Panda, TACC-Intel Highly-Parallel Computing Symposium, Apr 2012 [Slides] [Bib - Plain]
297	Multi-threaded UPC Runtime with Network Endpoints: Design Alternatives and Evaluation on Multi-core Architectures M. Luo, J. Jose, S. Sur, and DK Panda, International Conference on High Performance Computing (HiPC '11), Dec 2011 [Slides] [Bib - Plain]
298	UPC Queues for Scalable Graph Traversals: Design and Evaluation on InfiniBand Clusters J. Jose, S. Potluri, M. Luo, S. Sur, and DK Panda, Fifth Conference on Partitioned Global Address Space Programming Model (PGAS '11), Oct 2011 [Slides] [Bib - Plain]
299	Memcached Design on High Performance RDMA Capable Interconnects J. Jose, H. Subramoni, M. Luo, M. Zhang, J. Huang, M. W. Rahman, N. Islam, X. Ouyang, H. Wang, S. Sur, and DK Panda, International Conference on Parallel Processing (ICPP '11), Sep 2011 [Slides] [Bib - Plain]
300	Can a Decentralized Metadata Service Layer benefit Parallel Filesystems? Workshop on Interfaces and Architectures for Scientific Data Storage (IASDS '11) V. Meshram, X. Besseron, X. Ouyang, R. Rajachandrasekar, and DK Panda, held in conjunction with Cluster '11, Sep 2011 [Bib - Plain]
301	MPI Alltoall Personalized Exchange on GPGPU Clusters: Design Alternatives and Benefits A. Singh, S. Potluri, H. Wang, K. Kandalla, S. Sur, and DK Panda, International Workshop on Parallel Programming on Accelerator Clusters (PPAC '11), Sep 2011 [Slides] [Bib - Plain]
302	Design and Evaluation of Network Topology-/Speed- Aware Broadcast Algorithms for InfiniBand Clusters H. Subramoni, K. Kandalla, J. Vienne, S. Sur, B. Barth, K. Tomko, R. McLay, K. Schulz, and DK Panda, IEEE Cluster '11, Sep 2011 [Bib - Plain]
303	Optimized Non-contiguous MPI Datatype Communication for GPU Clusters: Design Implementation and Evaluation with MVAPICH2 H. Wang, S. Potluri, M. Luo, A. Singh, X. Ouyang, S. Sur, and DK Panda, IEEE Cluster '11, Sep 2011 [Slides] [Bib - Plain]
304	Optimizing MPI One Sided Communication on Multi-core InfiniBand Clusters using Shared Memory Backed Windows S. Potluri, H. Wang, V. Dhanraj, S. Sur, and DK Panda, EuroMPI '11, Sep 2011 [Bib - Plain]
305	Design and Implementation of Key Proposed MPI-3 One-Sided Communication Semantics on InfiniBand S. Potluri, S. Sur, D. Bureddy, and DK Panda, EuroMPI '11, Sep 2011 [Slides] [Poster/Short Paper] [Bib - Plain]
306	CRFS: A Lightweight User-Level Filesystem for Generic Checkpoint/Restart X. Ouyang, R. Rajachandrasekar, X. Besseron, H. Wang, J. Huang, and DK Panda, International Conference on Parallel Processing (ICPP '11), Sep 2011 [Slides] [Bib - Plain]
307	Can Checkpoint/Restart Mechanisms Benefit from Hierarchical Data Staging? Workshop on Resiliency in High Performance Computing in Clusters R. Rajachandrasekar, X. Ouyang, X. Besseron, V. Meshram, and DK Panda, Workshop on Resiliency in High Performance Computing in Clusters, Clouds, and Grids 2011, held in conjunction with EuroPar, Aug 2011 [Bib - Plain]
308	INAM - A Scalable InfiniBand Network Analysis and Monitoring Tool N. Dandapanthula, H. Subramoni, J. Vienne, K. Kandalla, S. Sur, DK Panda, and R. Brightwell, 4th International Workshop on Productivity and Performance (PROPER 2011), Aug 2011 [Slides] [Bib - Plain]
309	Designing Non-blocking Broadcast with Collective Offload on InfiniBand Clusters: A Case Study with HPL K. Kandalla, H. Subramoni, J. Vienne, K. Tomko, S. Sur, and DK Panda, Hot Interconnect '11, Aug 2011 [Bib - Plain]
310	High-Performance and Scalable Non-Blocking All-to-All with Collective Offload on InfiniBand Clusters: A Study with Parallel 3D FFT K. Kandalla, H. Subramoni, K. Tomko, D. Pekurovsky, S. Sur, and DK Panda, International Supercomputing Conference '11 (ISC'11), Jun 2011 [Bib - Plain]
311	MVAPICH2-GPU: Optimized GPU to GPU Communication for InfiniBand Clusters H. Wang, S. Potluri, M. Luo, A. Singh, S. Sur, and DK Panda, International Supercomputing Conference '11 (ISC'11), Jun 2011 [Slides] [Bib - Plain]
312	Scalable Memcached design for InfiniBand Clusters using Hybrid Transports J. Jose, H. Subramoni, K. Kandalla, M. W. Rahman, H. Wang, S. Narravula, and DK Panda, International Symposium on Cluster, May 2011 [Bib - Plain]
313	Efficient Intra-node Communication on Intel-MIC Clusters S. Potluri, A. Venkatesh, D. Bureddy, K. Kandalla, and DK Panda, International Symposium on Cluster, May 2011 [Slides] [Bib - Plain]
314	SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience J. Jose, M. Li, X. Lu, K. Kandalla, M. Arnold, and DK Panda, International Symposium on Cluster, May 2011 [Slides] [Bib - Plain]
315	High Performance Pipelined Process Migration with RDMA X. Ouyang, R. Rajachandrasekar, X. Besseron, and DK Panda, International Symposium on Cluster, May 2011 [Slides] [Bib - Plain]
316	Beyond Block I/O: Rethinking Traditional Storage Primitives X. Ouyang, D. Nellans, R. Wipfel, D. Flynn, and DK Panda, 17th IEEE International Symposium on High Performance Computer Architecture (HPCA-17), Feb 2011 [Slides] [Bib - Plain]
317	Can High-Performance Interconnects Benefit Hadoop Distributed File System? S. Sur, H. Wang, J. Huang, X. Ouyang, and DK Panda, Workshop on Micro Architectural Support for Virtualization, Dec 2010 [Slides] [Bib - Plain]
318	Scalable Earthquake Simulation on Petascale Supercomputers Y. Cui, K. B. Olsen, T. H. Jordan, K. Lee, J. Zhou, P. Small, D. Roten, G. Ely, DK Panda, A. Chourasia, J. Levesque, S. M. Day, and P. Maechling, SuperComputing 2010, Nov 2010 [Bib - Plain]
319	Unifying UPC and MPI Runtimes: Experience with MVAPICH J. Jose, M. Luo, S. Sur, and DK Panda, International Workshop on Partitioned Global Address Space (PGAS '10), Oct 2010 [Slides] [Bib - Plain]
320	RDMA-Based Job Migration Framework for MPI over InfiniBand Int'l Conference on Cluster Computing (Cluster '10) X. Ouyang, S. Marcarelli, R. Rajachandrasekar, and DK Panda, IEEE International Conference on Cluster Computing 2010, Sep 2010 [Bib - Plain]
321	Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters H. Subramoni, P. Lai, S. Sur, and DK Panda, International Conference on Parallel Processing (ICPP '10), Sep 2010 [Slides] [Bib - Plain]
322	Designing Power-Aware Collective Communication Algorithms for InfiniBand Clusters K. Kandalla, E. Mancini, S. Sur, and DK Panda, International Conference on Parallel Processing (ICPP '10), Sep 2010 [Slides] [Bib - Plain]
323	High Performance Design and Implementation of Nemesis Communication Layer for Two-sided and One-Sided MPI Semantics in MVAPICH2 M. Luo, S. Potluri, P. Lai, E. Mancini, H. Subramoni, K. Kandalla, S. Sur, and DK Panda, International Workshop on Parallel Programming Models and Systems Software for High-End Computing (P2S2 '10), Sep 2010 [Bib - Plain]
324	Design and Evaluation of Generalized Collective Communication Primitives with Overlap using ConnectX-2 Offload Engine H. Subramoni, K. Kandalla, S. Sur, and DK Panda, International Symposium on High Performance Interconnects 2010, Aug 2010 [Bib - Plain]
325	Quantifying Performance Benefits of Overlap using MPI-2 in a Seismic Modeling Application S. Potluri, P. Lai, K. Tomko, S. Sur, Y. Cui, M. Tatineni, K. Schulz, W. Barth, A. Majumdar, and DK Panda, 24th International Conference on Supercomputing (ICS), Jun 2010 [Bib - Plain]
326	Designing Truly One-Sided MPI-2 RMA Intra-node Communication on Multi-core Systems P. Lai, S. Sur, and DK Panda, 24th International Conference on Supercomputing (ICS), Jun 2010 [Slides] [Bib - Plain]
327	High Performance Data Transfer in Grid Environment Using GridFTP over InfiniBand H. Subramoni, P. Lai, R. Kettimuthu, and DK Panda, 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid'10), May 2010 [Slides] [Bib - Plain]
328	Enhancing Checkpoint Performance with Staging IO and SSD X. Ouyang, S. Marcarelli, and DK Panda, IEEE International Workshop on Storage Network Architecture and Parallel I/Os (SNAPI), May 2010 [Slides] [Bib - Plain]
329	Designing Topology-Aware Collective Communication Algorithms for Large Scale InfiniBand Clusters: Case Studies with Scatter and Gather K. Kandalla, H. Subramoni, A. Vishnu, and DK Panda, International Workshop on Communication Architecture for Clusters (CAC 10), Apr 2010 [Bib - Plain]
330	Designing High-Performance and Resilient Message Passing on InfiniBand M. Koop, P. Shamis, I. Rabinovitz, and DK Panda, International Workshop on Communication Architecture for Clusters (CAC 10), Apr 2010 [Bib - Plain]
331	Designing Efficient FTP Mechanisms for High Performance Data-Transfer over InfiniBand P. Lai, H. Subramoni, S. Narravula, A. Mamidala, and DK Panda, International Conference on Parallel Processing (ICPP '09), Sep 2009 [Slides] [Bib - Plain]
332	Accelerating Checkpoint Operation by Node-Level Write Aggregation on Multicore Systems X. Ouyang, K. Gopalakrishnan, and DK Panda, International Conference on Parallel Processing (ICPP '09), Sep 2009 [Slides] [Bib - Plain]
333	CIFTS: A Coordinated Infrastructure for Fault-Tolerant Systems R. Gupta, P. Beckman, H. Park, E. Lusk, P. Hargrove, A. Geist, DK Panda, A. Lumsdaine, and J. Dongarra, International Conference on Parallel Processing (ICPP '09), Sep 2009 [Bib - Plain]
334	Designing and Evaluating MPI-2 Dynamic Process Management Support for InfiniBand T. Gangadharappa, M. Koop, and DK Panda, International Workshop on Parallel Programming Models and Systems Software for High-End Computing (P2S2 '09), Sep 2009 [Bib - Plain]
335	Impact of Node Level Caching in MPI Job Launch Mechanisms J. Sridhar, and DK Panda, EuroPVM/MPI '09, Sep 2009 [Slides] [Bib - Plain]
336	An Efficient Hardware-Software Approach to Network Fault Tolerance with InfiniBand A. Vishnu, M. Krishnan, and DK Panda, International Conference on Cluster Computing (Cluster '09), Sep 2009 [Slides] [Bib - Plain]
337	Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters M. Koop, M. Luo, and DK Panda, International Conference on Cluster Computing (Cluster '09), Sep 2009 [Slides] [Bib - Plain]
338	Design Alternatives for Implementing Fence Synchronization in MPI-2 One-sided Communication on InfiniBand Clusters G. Santhanaraman, T. Gangadharappa, S. Narravula, A. Mamidala, and DK Panda, International Conference on Cluster Computing (Cluster '09), Sep 2009 [Slides] [Bib - Plain]
339	RDMA over Ethernet - A Preliminary Study H. Subramoni, P. Lai, M. Luo, and DK Panda, International Workshop on High Performance Distributed Computing (HPI-DC '09), Sep 2009 [Slides] [Bib - Plain]
340	ProOnE: A General Purpose Protocol Onload Engine for Multi- and Many-Core Architectures P. Lai, P. Balaji, R. Thakur, and DK Panda, International Supercomputing Conference (ISC), Jun 2009 [Bib - Plain]
341	Designing Multi-Leader-Based Allgather Algorithms for Multi-Core Clusters K. Kandalla, H. Subramoni, G. Santhanaraman, and DK Panda, International Workshop on Communication Architecture for Clusters (CAC'09), May 2009 [Slides] [Bib - Plain]
342	Fast Checkpointing by Write Aggregation with Dynamic Buffer and Interleaving on Multicore Architecture X. Ouyang, K. Gopalakrishnan, DK Panda, Fast Checkpointing by Write Aggregation with Dynamic Buffer, and Interleaving on Multicore Architecture, Int'l Conference on High Performance Computing 2009, Feb 2009 [Slides] [Bib - Plain]
343	ScELA: Scalable and Extensible Launching Architecture for Clusters J. Sridhar, M. Koop, J. Perkins, and DK Panda, International Symposium on High Performance Computing (HiPC), Dec 2008 [Slides] [Bib - Plain]
344	Designing High Performance pNFS With RDMA on InfiniBand R. Noronha, X. Ouyang, and DK Panda, International Symposium on High Performance Computing (HiPC), Dec 2008 [Bib - Plain]
345	Sockets Direct Protocol for Hybrid Network Stacks: A Case Study with iWARP over 10G Ethernet P. Balaji, S. Bhagvat, R. Thakur, and DK Panda, International Symposium on High Performance Computing (HiPC), Dec 2008 [Slides] [Bib - Plain]
346	Design and Evaluation of Benchmarks for Financial Applications using Advanced Message Queuing Protocol (AMQP) over InfiniBand H. Subramoni, G. Marsh, S. Narravula, P. Lai, and DK Panda, Workshop on High Performance Computational Finance (In conjunction with SC '08), Nov 2008 [OSU Technical Report Version (OSU-CISRC-10/08-TR51)] [Bib - Plain]
347	Scalable MPI Design over InfiniBand using eXtended Reliable Connection M. Koop, J. Sridhar, and DK Panda, IEEE Cluster 2008, Sep 2008 [Slides] [Bib - Plain]
348	Efficient One-Copy MPI Shared Memory Communication in Virtual Machines W. Huang, M. Koop, and DK Panda, IEEE Cluster 2008, Sep 2008 [Slides] [Bib - Plain]
349	IMCa: A High Performance Caching Frontend for GlusterFS on InfiniBand R. Noronha, and DK Panda, International Conference on Parallel Processing 2008, Sep 2008 [Slides] [Bib - Plain]
350	Performance of HPC middleware over InfiniBand WAN S. Narravula, H. Subramoni, P. Lai, R. Noronha, and DK Panda, International Conference on Parallel Processing 2008, Sep 2008 [Bib - Plain]
351	Designing An Efficient Kernel-level and User-level Hybrid Approach for MPI Intra-node Communication on Multi-core Systems L. Chai, P. Lai, H. Jin, and DK Panda, International Conference on Parallel Processing 2008, Sep 2008 [Slides] [Bib - Plain]
352	Lock-free Asynchronous Rendezvous Design for MPI Point-to-point Communication R. Kumar, A. Mamidala, M. Koop, G. Santhanaraman, and DK Panda, EuroPVM/MPI '08, Sep 2008 [OSU-CISRC-6/08-TR36] [Bib - Plain]
353	Can Software Reliability Outperform Hardware Reliability on High Performance Interconnects? A Case Study with MPI over InfiniBand M. Koop, R. Kumar, and DK Panda, 22nd ACM International Conference on Supercomputing (ICS '08), Jun 2008 [Bib - Plain]
354	Advanced RDMA-based Admission Control for Modern Data-Centers P. Lai, S. Narravula, K. Vaidyanathan, and DK Panda, CCGrid '08, May 2008 [Slides] [Bib - Plain]
355	Optimized Distributed Data Sharing Substrate in Multi-Core Commodity Clusters: A Comprehensive Study with Applications K. Vaidyanathan, and S. Narravula, CCGrid '08, May 2008 [Slides] [Bib - Plain]
356	MPI Collectives on modern Multicore clusters: Performance Optimizations and Communication Characteristics A. Mamidala, R. Kumar, D. De, and DK Panda, CCGrid '08, May 2008 [Bib - Plain]
357	Scaling Alltoall Collective on Multi-core Systems R. Kumar, A. Mamidala, and DK Panda, International Workshop on Communication Architecture for Clusters, Apr 2008 [Slides] [Bib - Plain]
358	pNFS/PVFS2 over InfiniBand: Early Experiences L. Chai, X. Ouyang, R. Noronha, and DK Panda, Petascale Data Storage Workshop, Nov 2007 [Slides] [Bib - Plain]
359	Virtual Machine Aware Communication Libraries for High Performance Computing W. Huang, M. Koop, Q. Gao, and DK Panda, SuperComputing (SC'07), Nov 2007 [Slides] [Best Student Paper Finalist] [Bib - Plain]
360	Enhancing the Performance of NFSv4 with RDMA R. Noronha, L. Chai, S. Shepler, and DK Panda, International Workshop on Storage Network Architecture and Parallel I/Os (SNAPI'07), Sep 2007 [Bib - Plain]
361	MPI-2 One Sided Usage and Implementation for Read Modify Write operations: A case study with HPCC G. Santhanaraman, S. Narravula, A. Mamidala, and DK Panda, EuroPVM/MPI 2007, Sep 2007 [Bib - Plain]
362	Zero-Copy Protocol for MPI using InfiniBand Unreliable Datagram M. Koop, S. Sur, and DK Panda, IEEE International Conference on Cluster Computing 2007, Sep 2007 [Bib - Plain]
363	High Performance Virtual Machine Migration with RDMA over Modern Interconnects W. Huang, Q. Gao, J. Liu, and DK Panda, IEEE International Conference on Cluster Computing 2007, Sep 2007 [Best Paper] [Bib - Plain]
364	Efficient Asynchronous Memory Copy Operations on Multi-Core Systems and I/OAT K. Vaidyanathan, L. Chai, W. Huang, and DK Panda, IEEE International Conference on Cluster Computing 2007, Sep 2007 [Bib - Plain]
365	Group-based Coordinated Checkpointing for MPI: A Case Study on InfiniBand Q. Gao, W. Huang, M. Koop, and DK Panda, International Conference on Parallel Processing (ICPP'07), Sep 2007 [Slides] [Bib - Plain]
366	High Performance MPI over iWARP: Early Experiences S. Narravula, A. Mamidala, A. Vishnu, G. Santhanaraman, and DK Panda, High Performance MPI over iWARP: Early Experiences, Sep 2007 [Bib - Plain]
367	Designing NFS With RDMA For Security, Performance and Scalability R. Noronha, L. Chai, T. Talpey, and DK Panda, International Conference on Parallel Processing 2007, Sep 2007 [Bib - Plain]
368	Designing Next Generation Clusters: Evaluation of InfiniBand DDR/QDR on Intel Computing Platforms H. Subramoni, M. Koop, and DK Panda, International Symposium on Hot Interconnects (HotI), Aug 2007 [Slides] [Bib - Plain]
369	Performance Analysis and Evaluation of PCIe 2.0 and Quad-Data Rate InfiniBand M. Koop, W. Huang, K. Gopalakrishnan, and DK Panda, International Symposium on Hot Interconnects (HotI), Aug 2007 [Bib - Plain]
370	Performance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms S. Sur, M. Koop, L. Chai, and DK Panda, International Symposium on Hot Interconnects (HotI), Aug 2007 [Slides] [Bib - Plain]
371	High Performance MPI Design using Unreliable Datagram for Ultra-Scale InfiniBand Clusters M. Koop, S. Sur, Q. Gao, and DK Panda, 21st International ACM Conference on Supercomputing (ICS '07), Jun 2007 [Bib - Plain]
372	Nomad: Migrating OS-bypass Networks in Virtual Machines W. Huang, J. Liu, M. Koop, B. Abali, and DK Panda, Third International SIGPLAN/SIGOPS Conference on Virtual Execution Environments (VEE), Jun 2007 [Bib - Plain]
373	High Performance Distributed Lock Management Services using Network-based Remote Atomic Operations S. Narravula, A. Mamidala, A. Vishnu, K. Vaidyanathan, and DK Panda, International Sympsoium on Cluster Computing and the Grid (CCGrid 2007), May 2007 [Slides] [Bib - Plain]
374	Design and Implementation of High Performance MVAPICH2: MPI2 over InfiniBand W. Huang, G. Santhanaraman, H. Jin, Q. Gao, and DK Panda, International Sympsoium on Cluster Computing and the Grid (CCGrid 2007), May 2007 [Bib - Plain]
375	Benefits of I/O Acceleration Technology (I/OAT) in Clusters K. Vaidyanathan, and DK Panda, International Symposium on Performance Analysis of Systems and Software (ISPASS), Apr 2007 [Bib - Plain]
376	Designing Efficient Systems Services and Primitives for Next-Generation Data-Centers K. Vaidyanathan, S. Narravula, P. Balaji, and DK Panda, Workshop on NSF Next Generation Software(NGS) Program; held in conjunction with IPDPS, Apr 2007 [Bib - Plain]
377	Improving Scalability of OpenMP Applications on MultiCore Systems Using Large Page Support R. Noronha, and DK Panda, International Workshop on Multithreaded Architectures and Applications (MTAAP), Mar 2007 [Bib - Plain]
378	High Performance MPI on IBM 12x InfiniBand Architecture A. Vishnu, B. Benton, and DK Panda, International Workshop on High-Level Parallel Programming Models and Supportive Environments (HIPS), Mar 2007 [Bib - Plain]
379	Automatic Path Migration over InfiniBand: Early Experience A. Vishnu, A. Mamidala, S. Narravula, and DK Panda, Third International Workshop on System Management Techniques, Mar 2007 [Bib - Plain]
380	Designing Efficient Asynchronous Memory Operations Using Hardware Copy Engine: A Case Study with I/OAT K. Vaidyanathan, W. Huang, L. Chai, and DK Panda, International Workshop on Communication Architecture for Clusters (CAC), Mar 2007 [Bib - Plain]
381	Using Connection-Oriented and Connection-Less Transport on Performance and Scalability of Collective and One-sided operations: Trade-offs and Impact A. Mamidala, S. Narravula, A. Vishnu, G. Santhanaraman, and DK Panda, International Symposium on Principles and Practice of Parallel Programming (PPoPP 2007), Mar 2007 [Bib - Plain]
382	DDSS: A Low-Overhead Distributed Data Sharing Substrate for Cluster-Based Data-Centers over Modern Interconnects K. Vaidyanathan, S. Narravula, and DK Panda, International Conference on High Performance Computing (HiPC), Dec 2006 [Slides] [Bib - Plain]
383	Finding Bugs in Large-Scale Parallel Programs by Detecting Anomaly in Data Movements Q. Gao, F. Qin, and DK Panda, SuperComputing 2006, Nov 2006 [Bib - Plain]
384	Analyzing the Impact of Supporting Out-of-Order Communication on In-order Performance with iWARP P. Balaji, W. Feng, S. Bhagvat, DK Panda, R. Thakur, and W. Gropp, SuperComputing 2006, Nov 2006 [Bib - Plain]
385	High-Performance and Scalable MPI over InfiniBand with Reduced Memory Usage: An In-Depth Performance Analysis S. Sur, M. Koop, and DK Panda, SuperComputing 2006, Nov 2006 [Bib - Plain]
386	A Software Based Approach for Providing Network Fault Tolerance in Clusters Using the uDAPL Interface: MPI Level Design and Performance Evaluation A. Vishnu, P. Gupta, A. Mamidala, and DK Panda, SuperComputing 2006, Nov 2006 [Bib - Plain]
387	NemC: A Network Emulator for Cluster-of-Clusters H. Jin, S. Narravula, K. Vaidyanathan, and DK Panda, International Conf. on Computer Commn. and Networks, Oct 2006 [Bib - Plain]
388	Designing Efficient MPI Intra-node Communication Support for Modern Computer Architectures L. Chai, A. Hartono, and DK Panda, International Conference on Cluster Computing, Sep 2006 [Bib - Plain]
389	Efficient Shared Memory and RDMA based design for MPI\_Allgather over InfiniBand A. Mamidala, A. Vishnu, and DK Panda, EuroPVM/MPI, Sep 2006 [Bib - Plain]
390	Exploiting RDMA operations for Providing Efficient Fine-Grained Resource Monitoring in Cluster-based Servers K. Vaidyanathan, H. Jin, and DK Panda, Workshop on Remote Direct Memory Access (RDMA): Applications, Implementations, and Technologies, Sep 2006 [Bib - Plain]
391	Memory Scalability Evaluation of the Next-Generation Intel Bensley Platform with InfiniBand M. Koop, W. Huang, A. Vishnu, and DK Panda, International Symposium on Hot Interconnect 2006 (HotI'06), Aug 2006 [Slides] [Bib - Plain]
392	Application-Transparent Checkpoint/Restart for MPI Programs over InfiniBand Q. Gao, W. Yu, W. Huang, and DK Panda, International Conference on Parallel Processing (ICPP), Aug 2006 [Slides] [Bib - Plain]
393	High Performance Block I/O for Global File System (GFS) with InfiniBand RDMA S. Liang, W. Yu, and DK Panda, International Conference on Parallel Processing (ICPP), Aug 2006 [Bib - Plain]
394	A Case for High Performance Computing with Virtual Machines W. Huang, J. Liu, B. Abali, and DK Panda, International Conference on Supercomputing (ICS), Jun 2006 [Slides] [Bib - Plain]
395	High Performance VMM-Bypass I/O in Virtual Machines J. Liu, W. Huang, B. Abali, and DK Panda, USENIX Annual Technical Conference, Jun 2006 [Bib - Plain]
396	An MPI-Stream Hybrid Programming Model for Computational Clusters E. Mancini, G. Marsh, and DK Panda, International Symposium on Cluster Computing and the Grid (CCGrid 2006), May 2006 [Slides] [Bib - Plain]
397	Natively Supporting True One-sided Communication in MPI on Multi-core Systems with InfiniBand G. Santhanaraman, P. Balaji, K. Gopalakrishnan, R. Thakur, W. Gropp, and DK Panda, International Symposium on Cluster Computing and the Grid (CCGrid 2006), May 2006 [Bib - Plain]
398	Reducing Connection Memory Requirements of MPI for InfiniBand Clusters: A Message Coalescing Approach M. Koop, T. Jones, and DK Panda, International Symposium on Cluster Computing and the Grid (CCGrid 2006), May 2006 [Bib - Plain]
399	Understanding the Impact of Multi-Core Architecture in Cluster Computing: A Case Study with Intel Dual-Core System L. Chai, Q. Gao, and DK Panda, International Symposium on Cluster Computing and the Grid (CCGrid 2006), May 2006 [Bib - Plain]
400	Hot-Spot Avoidance With Multi-Pathing Over InfiniBand: An MPI Perspective A. Vishnu, M. Koop, A. Moody, A. Mamidala, S. Narravula, and DK Panda, International Symposium on Cluster Computing and the Grid (CCGrid 2006), May 2006 [Bib - Plain]
401	Designing Efficient Cooperative Caching Schemes for Multi-Tier Data-Centers over RDMA-enabled Networks S. Narravula, H. Jin, K. Vaidyanathan, and DK Panda, International Symposium on Cluster Computing and the Grid (CCGrid 2006), May 2006 [Bib - Plain]
402	MPI over uDAPL: Can High Performance and Portability Exist Across Architectures? L. Chai, R. Noronha, and DK Panda, International Sympsoium on Cluster Computing and the Grid 2006, May 2006 [Bib - Plain]
403	Designing High Performance and Scalable MPI Intra-node Communication Support for Clusters L. Chai, and DK Panda, International Sympsoium on Cluster Computing and the Grid 2006, May 2006 [Slides] [Bib - Plain]
404	Designing Next-Generation Data-Centers with Advanced Communication Protocols and Systems Services P. Balaji, K. Vaidyanathan, S. Narravula, H. Jin, and DK Panda, Workshop on NSF Next Generation Software(NGS) Program; held in conjuction with IPDPS, Apr 2006 [Slides] [Bib - Plain]
405	Shared Receive Queue based Scalable MPI Design for InfiniBand Clusters S. Sur, L. Chai, H. Jin, and DK Panda, International Parallel and Distributed Processing Symposium (IPDPS '06), Apr 2006 [Bib - Plain]
406	Adaptive Connection Management for Scalable MPI over InfiniBand W. Yu, Qi Gao, and DK Panda, International Parallel and Distributed Processing Symposium (IPDPS '06), Apr 2006 [Slides] [Bib - Plain]
407	Efficient SMP-Aware MPI-Level Broadcast over InfiniBand's Hardware Multicast A. Mamidala, L. Chai, H. Jin, and DK Panda, Communication Architecture for Clusters (CAC) Workshop, Apr 2006 [Bib - Plain]
408	Asynchronous Zero-Copy Communication for Synchronous Sockets Direct Protocol (SDP) over InfiniBand P. Balaji, S. Bhagvat, H. Jin, and DK Panda, Communication Architecture for Clusters (CAC) Workshop, Apr 2006 [Bib - Plain]
409	Benefits of High Speed Interconnects to Cluster File Systems: A Case Study with Lustre W. Yu, R. Noronha, S. Liang, and DK Panda, Communication Architecture for Clusters (CAC) Workshop, Apr 2006 [Bib - Plain]
410	RDMA Read Based Rendezvous Protocol for MPI over InfiniBand: Design Alternatives and Benefits S. Sur, L. Chai, H. Jin, and DK Panda, International Symposium on Principles and Practice of Parallel Programming (PPoPP 2006), Mar 2006 [Slides] [Bib - Plain]
411	A Case for UDP Offload Engines in LambdaGrids V. Vishwanathz, P. Balaji, W. Feng, J. Leigh, and DK Panda, International Workshop on Protocols for Fast Long-Distance Networks (PFLDnet 2006), Feb 2006 [Bib - Plain]
412	High Performance RDMA Based All-to-all Broadcast for InfiniBand Clusters S. Sur, U. Bondhugula, A. Mamidala, H. Jin, and DK Panda, International Conference on High Performance Computing (HiPC 2005), Dec 2005 [Bib - Plain]
413	Supporting MPI-2 One Sided Communication on Multi-Rail InfiniBand Clusters: Design Challenges and Performance Benefits A. Vishnu, G. Santhanaraman, W. Huang, H. Jin, and DK Panda, International Conference on High Performance Computing (HiPC 2005), Dec 2005 [Bib - Plain]
414	Supporting iWARP Compatibility and Features for Regular Network Adapters P. Balaji, H. Jin, K. Vaidyanathan, and DK Panda, Workshop on Remote Direct Memory Access (RDMA): Applications, Implementations, and Technologies, Sep 2005 [Slides] [Bib - Plain]
415	Head-to-TOE Evaluation of High-Performance Sockets over Protocol Offload Engines P. Balaji, W. Feng, Q. Gao, R. Noronha, W. Yu, and DK Panda, IEEE Cluster Computing 2005, Sep 2005 [Slides] [Bib - Plain]
416	Swapping to Remote Memory over InfiniBand: An Approach using a High Performance Network Block Device S. Liang, R. Noronha, and DK Panda, IEEE Cluster Computing 2005, Sep 2005 [Slides] [Bib - Plain]
417	Benefits of Quadrics Scatter/Gather to PVFS2 Noncontiguous I/O W. Yu, and DK Panda, International Workshop on Storage Network Architecture and Parallel I/Os (SNAPI) 2005. Sept. 2005., Sep 2005 [Slides] [Bib - Plain]
418	Can Memory-Less Network Adapters Benefit Next-Generation InfiniBand Systems? S. Sur, A. Vishnu, H. Jin, W. Huang, and DK Panda, Hot Interconnect 13 (HOTI 05), Aug 2005 [Slides] [Bib - Plain]
419	Performance Characterization of a 10-Gigabit Ethernet TOE W. Feng, P. Balaji, C. Baron, L. N. Bhuyan, and DK Panda, Hot Interconnect 13 (HOTI 05), Aug 2005 [Slides] [Bib - Plain]
420	Performance Evaluation of MM5 on Clusters With Modern Interconnects: Scalability and Impact R. Noronha, and DK Panda, Euro-Par, Aug 2005 [Bib - Plain]
421	Performance Evaluation of RDMA over IP: A Case Study with the Ammasso Gigabit Ethernet NIC H. Jin, S. Narravula, K. Vaidyanathan, P. Balaji, and DK Panda, Workshop on High Performance Interconnects for Distributed Computing (HPI-DC); In conjunction with HPDC-14, Jul 2005 [Bib - Plain]
422	High Performance Support of Parallel Virtual File System (PVFS2) over Quadrics W. Yu, S. Liang, and DK Panda, International Conference on Supercomputing (ICS '05), Jun 2005 [Bib - Plain]
423	LiMIC: Support for High-Performance MPI Intra-Node Communication on Linux Cluster H. Jin, S. Sur, L. Chai, and DK Panda, International Conference on Parallel Processing (ICPP-05), Jun 2005 [Slides] [Bib - Plain]
424	Architecture for Caching Responses with Multiple Dynamic Dependencies in Multi-Tier Data-Centers over InfiniBand S. Narravula, P. Balaji, K. Vaidyanathan, H. Jin, and DK Panda, IEEE/ACM International Symposium on Cluster Computing and the Grid (CCGrid 05), May 2005 [Slides] [Bib - Plain]
425	Can High Performance Software DSM Systems Designed With InfiniBand Features Benefit from PCI-Express? R. Noronha, and DK Panda, DSM Workshop, May 2005 [Bib - Plain]
426	Designing Multi-Level, Multi-Tier Data Center Architecture for Securing Distributed Infrastructure and Assets DK Panda, DHS Homeland Security Conference, Apr 2005 [Bib - Plain]
427	Analysis of Design Considerations for Optimizing Multi-Channel MPI over InfiniBand L. Chai, S. Sur, H. Jin, and DK Panda, Workshop on Communication Architecture on Clusters (CAC '05), Apr 2005 [Bib - Plain]
428	Scheduling of MPI-2 One Sided Operations over InfiniBand W. Huang, G. Santhanaraman, H. Jin, and DK Panda, Workshop on Communication Architecture on Clusters (CAC '05), Apr 2005 [Slides] [Bib - Plain]
429	Performance Modeling of Subnet Management on Fat Tree InfiniBand Networks using OpenSM A. Vishnu, A. Mamidala, and H.- W, Workshop on System Management Tools on Large Scale Parallel Systems, Apr 2005 [Bib - Plain]
430	Design and Implementation of Open MPI over Quadrics/Elan4 W. Yu, T. S. Woodall, R. L. Graham, and DK Panda, International Parallel and Distributed Processing Symposium (IPDPS 2005). April 2005., Apr 2005 [Slides] [Bib - Plain]
431	On the Provision of Prioritization and Soft QoS in Dynamically Reconfigurable Shared Data-Centers over InfiniBand P. Balaji, S. Narravula, K. Vaidyanathan, H. Jin, and DK Panda, IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS 05), Mar 2005 [Slides] [Bib - Plain]
432	Workload-driven Analysis of File Systems in Shared Multi-Tier Data-Centers over InfiniBand K. Vaidyanathan, P. Balaji, H. Jin, and DK Panda, Computer Architecture Evaluation using Commercial Workloads (in conjunction with HPCA), Feb 2005 [Slides] [Bib - Plain]
433	Scalable Startup of Parallel Programs over InfiniBand W. Yu, J. Wu, and DK Panda, International Conference on High Performance Computing (HiPC '04), Dec 2004 [Slides] [Bib - Plain]
434	Building Multirail InfiniBand Clusters: MPI-Level Design and Performance Evaluation J. Liu, A. Vishnu, and DK Panda, SuperComputing 2004 Conference (SC 04), Nov 2004 [Slides] [Bib - Plain]
435	Reducing Diff Overhead in Software DSM Systems using RDMA Operations in InfiniBand R. Noronha, and DK Panda, Workshop on Remote Direct Memory Access (RDMA): Applications, Implementations, and Technologies in conjunction with the IEEE Cluster, Sep 2004 [Slides] [Bib - Plain]
436	Exploiting Remote Memory Operations to Design Efficient Reconfiguration for Shared Data-Centers over InfiniBand P. Balaji, K. Vaidyanathan, S. Narravula, K. Savitha, H. Jin, and DK Panda, Workshop on Remote Direct Memory Access (RDMA): Applications, Implementations, and Technologies in conjunction with the IEEE Cluster, Sep 2004 [Slides] [Bib - Plain]
437	Sockets vs RDMA Interface over 10-Gigabit Networks: An In-depth analysis of the Memory Traffic Bottleneck P. Balaji, H. V. Shah, and DK Panda, Workshop on Remote Direct Memory Access (RDMA): Applications, Implementations, and Technologies in conjunction with the IEEE Cluster, Sep 2004 [Slides] [Bib - Plain]
438	Scalable and High Performance NIC-Based Allgather over Myrinet/GM W. Yu, D. Buntinas, and DK Panda, International Conference on Cluster Computing 2004, Sep 2004 [Slides] [Bib - Plain]
439	Efficient Barrier and Allreduce on IBA Clusters using Hardware Multicast and Adaptive Algorithms A. Mamidala, J. Liu, and DK Panda, International Conference on Cluster Computing 2004, Sep 2004 [Bib - Plain]
440	NIC-Based Offload of Dynamic User-Defined Modules for Myrinet Clusters A. Wagner, H. Jin, R. Riesen, and DK Panda, International Conference on Cluster Computing 2004, Sep 2004 [Bib - Plain]
441	Zero-Copy MPI Derived Datatype Communication over InfiniBand G. Santhanaraman, J. Wu, and DK Panda, EuroPVM/MPI 2004, Sep 2004 [Slides] [Bib - Plain]
442	Efficient Implementation of MPI-2 Passive One-Sided Communication on InfiniBand Clusters W. Jiang, J. Liu, H. Jin, DK Panda, D. Buntinas, R. Thakur, and W. Gropp, EuroPVM/MPI 2004, Sep 2004 [Slides] [Bib - Plain]
443	Performance Evaluation of InfiniBand with PCI Express J. Liu, A. Mamidala, A. Vishnu, and DK Panda, Hot Interconnect 12 (HOTI 04), Aug 2004 [Bib - Plain]
444	Efficient and Scalable All-to-All Personalized Exchange for InfiniBand-based Clusters S. Sur, H. Jin, and DK Panda, International Conference on Parallel Processing (ICPP '04), Aug 2004 [Bib - Plain]
445	Design and Implementation of MPICH2 over InfiniBand with RDMA Support J. Liu, W. Jiang, P. Wyckoff, DK Panda, D. Ashton, D. Buntinas, W. Gropp, and B. Toonen, International Parallel and Distributed Processing Symposium (IPDPS 04), Apr 2004 [Slides] [Bib - Plain]
446	Fast and Scalable MPI-Level Broadcast using InfiniBand's Hardware Multicast Support J. Liu, A. Mamidala, and DK Panda, International Parallel and Distributed Processing Symposium (IPDPS 04), Apr 2004 [Slides] [Bib - Plain]
447	High Performance Implementation of MPI Datatype Communication over InfiniBand J. Wu, P. Wyckoff, and DK Panda, International Parallel and Distributed Processing Symposium (IPDPS 04), Apr 2004 [Bib - Plain]
448	Host-Assisted Zero-Copy Remote Memory Access Communication on InfiniBand V. Tipparaju, G. Santhanaraman, J. Nieplocha, and DK Panda, International Parallel and Distributed Processing Symposium (IPDPS 04), Apr 2004 [Bib - Plain]
449	Implementing Efficient and Scalable Flow Control Schemes in MPI over InfiniBand J. Liu, and DK Panda, International Workshop on Communication Architecture for Clusters (CAC 04), Apr 2004 [Slides] [Bib - Plain]
450	Efficient and Scalable Barrier over Quadrics and Myrinet with a New NIC-Based Collective Message Passing Protocol W. Yu, and DK Panda, International Workshop on Communication Architecture for Clusters (CAC 04), Apr 2004 [Slides] [Bib - Plain]
451	High Performance MPI-2 One-Sided Communication over InfiniBand W. Jiang, J. Liu, H. Jin, DK Panda, W. Gropp, and R. Thakur, International Symposium on Cluster Computing and the Grid (CCGrid 04), Apr 2004 [Slides] [Bib - Plain]
452	Unifier: Unifying Cache Management and Communication Buffer Management for PVFS over InfiniBand J. Wu, P. Wyckoff, DK Panda, and R. Ross, International Symposium on Cluster Computing and the Grid (CCGrid 04), Apr 2004 [Bib - Plain]
453	Designing High Performance DSM Systems using InfiniBand Features R. Noronha, and DK Panda, International Workshop on Distributed Shared Memory Systems, Apr 2004 [Slides] [Bib - Plain]
454	Sockets Direct Protocol over InfiniBand in Clusters: Is it Beneficial? Int'l Symposium on Performance Analysis of Systems and Software (ISPASS 04). March P. Balaji, S. Narravula, K. Vaidyanathan, S. Krishnamoorthy, J. Wu, and DK Panda, International Symposium on Performance Analysis of Systems and Software, Apr 2004 [Bib - Plain]
455	Sockets Direct Procotol over InfiniBand in Clusters: Is it Beneficial? P. Balaji, S. Narravula, K. Vaidyanathan, S. Krishnamoorthy, J. Wu, and DK Panda, IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS 04), Apr 2004 [Slides] [Bib - Plain]
456	Supporting Strong Coherency for Active Caches in Multi-Tier Data-Centers over InfiniBand S. Narravula, P. Balaji, K. Vaidyanathan, S. Krishnamoorthy, J. Wu, and DK Panda, SAN-03 Workshop (in conjunction with HPCA), Feb 2004 [Slides] [Bib - Plain]
457	Evaluating the Impact of RDMA on Storage I/O over InfiniBand J. Liu, DK Panda, and M. Banikazemi, SAN-03 Workshop (in conjunction with HPCA), Feb 2004 [Slides] [Bib - Plain]
458	Application-Bypass Reduction for Large-Scale Clusters A. Wagner, D. Buntinas, R. Brightwell, and DK Panda, Cluster 2003 Conference, Dec 2003 [Bib - Plain]
459	Supporting Efficient Noncontiguous Access in PVFS over InfiniBand J. Wu, P. Wyckoff, and DK Panda, Cluster 2003 Conference, Dec 2003 [Bib - Plain]
460	Optimizing Mechanisms for Latency Tolerance in Remote Memory Access Communication V. Tipparaju, M. Krishnan, J. Nieplocha, G. Santhanaraman, and DK Panda, Cluster 2003 Conference, Dec 2003 [Bib - Plain]
461	Performance Comparison of MPI Implementations over InfiniBand, Myrinet and Quadrics J. Liu, B. Chandrasekaran, J. Wu, W. Jiang, S. Kini, W. Yu, D. Buntinas, P. Wyckoff, and DK Panda, SuperComputing 2003, Nov 2003 [Bib - Plain]
462	Scalable NIC-based Reduction on Large-scale Clusters A. Moody, J. Fernandez, F. Petrini, and DK Panda, SuperComputing 2003, Nov 2003 [Bib - Plain]
463	High Performance Broadcast Support in LA-MPI over Quadrics W. Yu, S. Sur, DK Panda, R. T. Aulwes, and R. Graham, Los Alamos Computer Science Institute (LACSI) Symposium, Oct 2003 [Slides] [Bib - Plain]
464	High Performance and Reliable NIC-Based Multicast over Myrinet/GM-2 W. Yu, D. Buntinas, and DK Panda, International Conference on Parallel Processing, Oct 2003 [Slides] [Bib - Plain]
465	PVFS over InfiniBand: Design and Performance Evaluation J. Wu, P. Wyckoff, and DK Panda, International Conference on Parallel Processing, Oct 2003 [Bib - Plain]
466	Designing a Portable MPI-2 over Modern Interconnects using uDAPL Interface L. Chai, R. Noronha, P. Gupta, G. Brown, and DK Panda, Euro PVM/MPI Conference, Sep 2003 [Bib - Plain]
467	Efficient Hardware Multicast Group Management for Multiple MPI Communicators over InfiniBand A. Mamidala, H. Jin, and DK Panda, Euro PVM/MPI Conference, Sep 2003 [Slides] [Bib - Plain]
468	Design Alternatives and Performance Trade-offs for Implementing MPI-2 over InfiniBand W. Huang, G. Santhanaraman, H. Jin, and DK Panda, Euro PVM/MPI Conference, Sep 2003 [Slides] [Bib - Plain]
469	Fast and Scalable Barrier using RDMA and Multicast Mechanisms for InfiniBand-Based Clusters S. Kini, J. Liu, J. Wu, P. Wyckoff, and DK Panda, Euro PVM/MPI Conference, Sep 2003 [Bib - Plain]
470	Demotion-Based Exclusive Caching through Demote Buffering: Design and Evaluations over Different Networks J. Wu, P. Wyckoff, and DK Panda, Workshop on Storage Network Architecture and Parallel I/O (SNAPI), Sep 2003 [Bib - Plain]
471	MIBA: A Micro-benchmark Suite for Evaluating InfiniBand Architecture Implementations B. Chandrasekaran, P. Wyckoff, and DK Panda, Performance TOOLS 2003, Sep 2003 [Bib - Plain]
472	Micro-Benchmark Level Performance Comparison of High-Speed Cluster Interconnects J. Liu, B. Chandrasekaran, W. Yu, J. Wu, D. Buntinas, S. P. Kinis, P. Wyckoff, and DK Panda, Hot Interconnects 10, Aug 2003 [Bib - Plain]
473	High Performance RDMA-Based MPI Implementation over InfiniBand J. Liu, J. Wu, S. Kini, P. Wyckoff, and DK Panda, International Conference on Supercomputing (ICS '03), Jun 2003 [Bib - Plain]
474	QoS-aware Middleware for Cluster-based Servers to Support Interactive and Resource-Adaptive Applications S. Senapathi, B. Chandrasekharan, D. Stredney, H.-W. Shen, and DK Panda, High Performance Distributed Computing, Jun 2003 [Bib - Plain]
475	Impact of High Performance Sockets on Data Intensive Applications P. Balaji, J. Wu, T. Kurc, U. Catalyurek, DK Panda, and J. Saltz, High Performance Distributed Computing, Jun 2003 [Bib - Plain]
476	Application-Bypass Broadcast in MPICH over GM D. Buntinas, DK Panda, and R. Brightwell, Cluster Computing and Grid (CCGrid '03), May 2003 [Bib - Plain]
477	Optimizing Barrier and Lock Operations in ARMCI D. Buntinas, A. Saify, DK Panda, and Jarek Nieplocha, International Workshop on Communication Architecture for Clusters (CAC '03), Apr 2003 [Bib - Plain]
478	Efficient Collective Operations using Remote Memory Operations on VIA-Based Clusters R. Gupta, P. Balaji, DK Panda, and J. Nieplocha, International Parallel and Distributed Processing Symposium (IPDPS '03), Apr 2003 [Bib - Plain]
479	NIC-Based Reduction in Myrinet Clusters: Is It Beneficial? D. Buntinas, and DK Panda, SAN-02 Workshop (in conjunction with HPCA), Apr 2003 [Bib - Plain]
480	A Portable Client/Server Communication Middleware over SANs: Design and Performance Evaluation with InfiniBand J. Liu, M. Banikazemi, B. Abali, and DK Panda, SAN-02 Workshop (in conjunction with HPCA), Apr 2003 [Bib - Plain]
481	Impact of On-Demand Connection Management in MPI over VIA J. Wu, J. Liu, P. Wyckoff, and DK Panda, Cluster '02, Sep 2002 [Bib - Plain]
482	Efficient Barrier using Remote Memory Operations on VIA-Based Clusters R. Gupta, V. Tipparaju, J. Nieplocha, and DK Panda, Cluster '02, Sep 2002 [Bib - Plain]
483	High Performance User-Level Sockets over Gigabit Ethernet P. Balaji, P. Shivam, P. Wyckoff, and DK Panda, Cluster '02, Sep 2002 [Bib - Plain]
484	A QoS Framework for Clusters to support Applications with Resource Adaptivity and Predictable Performance S. Senapathi, DK Panda, D. Stredney, and H.-W. Shen, International Workshop on Quality of Service (IWQoS), May 2002 [Bib - Plain]
485	Can User Level Protocols Take Advantage of Multi-CPU NICs? P. Shivam, P. Wyckoff, and DK Panda, International Parallel and Distributed Processing Symposium (IPDPS '02), Apr 2002 [Bib - Plain]
486	MPI/IO on DAFS Over VIA: Implementation and Performance Evaluation J. Wu, and DK Panda, Communication Architecture for Clusters (CAC'02) Workshop, Apr 2002 [Bib - Plain]
487	Protocols and Strategies for Optimizing Remote Memory Operations on Clusters (CAC'02) Workshop J. Nielplocha, V. Tipparaju, A. Saify, and DK Panda, held in conjunction with IPDPS '02, Apr 2002 [Bib - Plain]
488	NIC-Based Atomic Operations on Myrinet/GM D. Buntinas, DK Panda, and W. Gropp, SAN-1 Workshop, Feb 2002 [Bib - Plain]
489	EMP: Zero-copy OS-bypass NIC-driven Gigabit Ethernet Message Passing P. Shivam, P. Wyckoff, and DK Panda, Supercomputing '01., Feb 2002 [Bib - Plain]
490	Implementing TreadMarks over GM on Myrinet: Challenges, Design Experiences and Performance Evaluation R. Noronha, and DK Panda, The Workshop on Communication Architecture for Clusters held in conjunction with IPDPS 2003, Sep 2001 [Slides] [Bib - Plain]
491	Implementing TreadMarks over VIA on Myrinet and Gigabit Ethernet: Challenges, Design Experience, and Performance Evaluation M. Banikazemi, J. Liu, DK Panda, and P. Sadayappan, International Conference on Parallel Processing 2001, Sep 2001 [Bib - Plain]
492	NIC-based Rate Control for Proportional Bandwidth Allocation in Myrinet Clusters A. Gulati, DK Panda, P. Sadayappan, and P. Wyckoff, International Conference on Parallel Processing 2001, Sep 2001 [Bib - Plain]
493	Performance Benefits of NIC-Based Barrier on Myrinet/GM D. Buntinas, DK Panda, and P. Sadayappan, Workshop on Communication Architecture for Clusters (CAC '01), Apr 2001 [Bib - Plain]
494	Fast NIC-Based Barrier over Myrinet/GM D. Buntinas, DK Panda, and P. Sadayappan, International Parallel and Distributed Processing Symposium, Apr 2001 [Bib - Plain]
495	Can Scatter Communication Take Advantage of Multidestination Message Passing? M. Banikazemi, and DK Panda, International Symposium on High Performance Computing (HiPC '00), Dec 2000 [Bib - Plain]
496	Characterization and Enhancement of Static Mapping Heuristics for Heterogeneous Systems Praveen Holenarsipur, V. Yarmolenko, J. Duato, DK Panda, and P. Sadayappan, International Symposium on High Performance Computing (HiPC '00), Dec 2000 [Bib - Plain]
497	Dynamic Mapping Heuristics in Heterogeneous Systems V. Yarmolenko, J. Duato, DK Panda, and P. Sadayappan, Workshop on Network-Based Computing, Aug 2000 [Bib - Plain]
498	Balancing Web Server Load for Adaptive Video Distribution A. Paul, W.-C. Feng, DK Panda, and P. Sadayappan, Workshop on Multimedia Computing, Aug 2000 [Bib - Plain]
499	Implementing TreadMarks on Virtual Interface Architecture (VIA): Design Issues and Alternatives M. Banikazemi, DK Panda, and P. Sadayappan, Ninth Workshop on Scalable Shared Memory Multiprocessors, Jun 2000 [Bib - Plain]
500	TupleQ: Fully-Asynchronous and Zero-Copy MPI over InfiniBand M. Koop, J. Sridhar, and DK Panda, International Parallel and Distributed Processing Symposium (IPDPS), May 2000 [Slides] [Bib - Plain]
501	MVAPICH-Aptus: Scalable High-Performance Multi-Transport MPI over InfiniBand M. Koop, T. Jones, and DK Panda, International Parallel and Distributed Processing Symposium (IPDPS), May 2000 [Slides] [Bib - Plain]
502	Designing Passive Synchronization for MPI-2 One-Sided Communication to Maximize Overlap G. Santhanaraman, S. Narravula, and DK Panda, International Parallel and Distributed Processing Symposium (IPDPS), May 2000 [Bib - Plain]
503	VIBe: A Micro-benchmark Suite for Evaluating Virtual Interface Architecture (VIA) Implementations M. Banikazemi, J. Liu, S. Kutlug, A. Ramakrishna, P. Sadayappan, H. Sah, and DK Panda, International Parallel and Distributed Processing Symposium (IPDPS), May 2000 [Bib - Plain]
504	Efficient Multicast Algorithms for Heterogeneous Switch-based Irregular Networks of Workstations A. Singhal, M. Banikazemi, P. Sadayappan, and DK Panda, International Parallel and Distributed Processing Symposium (IPDPS), May 2000 [Bib - Plain]
505	Efficient Virtual Interface Architecture Support for the IBM SP Switch-Connected NT Clusters M. Banikazemi, V. Moorthy, L. Herger, DK Panda, and B. Abali, International Parallel and Distributed Processing Symposium (IPDPS), May 2000 [Bib - Plain]
506	Adaptive Routing in RS/6000 SP-like Bidirectional Multistage Interconnection Networks M. Banikazemi, C. B. Stunkel, DK Panda, and B. Abali, International Parallel and Distributed Processing Symposium (IPDPS), May 2000 [Bib - Plain]
507	Comparison and Evaluation of Design Choices for Implementing the Virtual Interface Architecture (VIA) M. Banikazemi, B. Abali, and DK Panda, Fourth International Workshop on Communication and Architectural Support for Network-Based Parallel Computing (CANPC'00), Jan 2000 [Bib - Plain]
508	Broadcast/Multicast over Myrinet Using NIC-Assisted Multidestination Messages D. Buntinas, DK Panda, J. Duato, and P. Sadayappan, Fourth International Workshop on Communication and Architectural Support for Network-Based Parallel Computing (CANPC'00), Jan 2000 [Bib - Plain]
509	Fast Collective Communication Algorithms for Reflective Memory Network Clusters V. Moorthy, DK Panda, and P. Sadayappan, Fourth International Workshop on Communication and Architectural Support for Network-Based Parallel Computing (CANPC'00), Jan 2000 [Bib - Plain]
510	Implementing Efficient MPI on LAPI for the IBM-SP: Experiences and Performance Evaluation M. Banikazemi, R. Govindaraju, R. Blackmore, and DK Panda, International Parallel Processing Symposium (IPPS'99), Jan 2000 [Bib - Plain]
511	Low Latency Message Passing on Workstation Clusters using SCRAMNet V. Moorthy, M. Jacunski, M. Pillai, P. Ware, DK Panda, T. Page, P. Sadayappan, V. Nagarajan, and J. Daniel, International Parallel Processing Symposium (IPPS'99), Jan 2000 [Bib - Plain]
512	Communication Modeling of Heterogeneous Networks of Workstations for Performance Characterization of Collective Operations M. Banikazemi, S. Prabhu, J. Sampathkumar, DK Panda, and P. Sadayappan, International Workshop on Heterogeneous Computing (HCW'99), Jan 2000 [Bib - Plain]
513	All-to-All Broadcast on Switch-Based Clusters of Workstations M. Jacunski, P. Sadayappan, and DK Panda, International Parallel Processing Symposium 1999, Apr 1999 [Bib - Plain]
514	Low Latency Message-Passing for Reflective Memory Networks M. Jacunski, V. Moorthy, P. Ware, M. Pillai, DK Panda, and P. Sadayappan, International Workshop on Communication, Jan 1999 [Bib - Plain]
515	Where to Provide Support for Efficient Multicasting in Irregular Networks: Network Interface or Switch? International Conference on Parallel Processing R. Sivaram, R. Kesavan, DK Panda, and Craig B. Stunkel, International Conference on Parallel Processing, Aug 1998 [ pp. 452-459] [Bib - Plain]
516	Experiences with Software MPEG-2 Video Decompression on an SMP PC A. Bala, D. Shah, W.-C. Feng, and DK Panda, ICPP Workshop, Aug 1998 [Bib - Plain]
517	HIPIQS: A High-Performance Switch Architecture using Input Queuing R. Sivaram, C. Stunkel, and DK Panda, International Parallel Processing Symposium (IPPS '98), Aug 1998 [Bib - Plain]
518	Prioritized Demand Multiplexing (PDM): A Low-Latency Virtual Channel Flow Control Framework for Prioritized Traffic A-H. Smai, DK Panda, and L-E. Thorelli, International Conference on High Performance Computing, Dec 1997 [Bib - Plain]
519	How Much Does Network Contention Affect Distributed Shared Memory Performance? D. Dai, and DK Panda, International Conference on Parallel Processing 1997, Dec 1997 [pp. 454-461] [Bib - Plain]
520	Optimal Multicast with Packetization and Network Interface Support R. Kesavan, and DK Panda, International Conference on Parallel Processing (ICPP'97), Dec 1997 [pp. 370-377] [Bib - Plain]
521	Multicasting on Switch-based Irregular Networks using Multi-drop Path-based Multidestination Worms R. Kesavan, and DK Panda, Parallel Computing, Routing, and Communication Workshop, Dec 1997 [Bib - Plain]
522	Multicasting in Irregular Networks with Cut-Through Switches using Tree-Based Multidestination Worms R. Sivaram, DK Panda, and C. B. Stunkel, Parallel Computing, Routing, and Communication Workshop, Dec 1997 [Bib - Plain]
523	How Can We Design Better Networks for DSM Systems? D. Dai, and DK Panda, Parallel Computing, Routing, and Communication Workshop, Dec 1997 [Bib - Plain]
524	Implementing Multidestination Worms in Switch-Based Parallel Systems: Architectural Alternatives and their Impact C. B. Stunkel, R. Sivaram, and DK Panda, International Symposium on Computer Architecture (ISCA'97), Jun 1997 [Bib - Plain]
525	A Reliable Hardware Barrier Synchronization Scheme R. Sivaram, C. B. Stunkel, and DK Panda, International Parallel Processing Symposium (IPPS'97), Apr 1997 [Bib - Plain]
526	Efficient Collective Communication on Heterogeneous Networks of Workstations M. Banikazemi, V. Moorthy, and DK Panda, International Conference on Parallel Processing, Aug 1996 [Bib - Plain]
527	Impact of Adaptivity on the Behavior of Networks of Workstations under Bursty Traffic F. Silla, M. P. Malumbres, J. Duato, D. Dai, and DK Panda, International Conference on Parallel Processing, Aug 1996 [Bib - Plain]
528	Designing Processor-cluster Based Systems: Interplay Between Cluster Organizations and Collective Communication Algorithms D. Basak, and DK Panda, International Conference on Parallel Processing, Aug 1996 [Bib - Plain]
529	Reducing Cache Invalidation Overheads in Wormhole DSMs using Multidestination Message Passing D. Dai, and DK Panda, International Conference on Parallel Processing, Aug 1996 [Bib - Plain]
530	Minimizing Node Contention in Multiple Multicast on Wormhole k-ary n-cube Networks R. Kesavan, and DK Panda, International Conference on Parallel Processing, Aug 1996 [Bib - Plain]
531	Hybrid Algorithms for Complete Exchange in 2D Meshes N. S. Sundar, D. N. Jayasimha, DK Panda, and P. Sadayappan, Proceedings of the International Conference on Supercomputing, May 1996 [Bib - Plain]
532	Multicast on Irregular Switch-based Networks with Wormhole Routing R. Kesavan, K. Bondalapati, and DK Panda, Proceedings of the Third International Symposium on High Performance Computer Architecture (HPCA-3), Feb 1996 [Bib - Plain]
533	Fast Barrier Synchronization in Wormhole k-ary n-cube Networks with Multidestination Worms DK Panda, International Symposium on High Performance Computer Architecture, Jan 1995 [Bib - Plain]
534	Issues in Designing Scalable Systems with k-ary n-cube cluster-c organization DK Panda, and D. Basak, International Workshop on Parallel Processing, Dec 1994 [Bib - Plain]
535	Architectural Issues in Designing Heterogeneous Parallel Systems with Passive Star-Coupled Optical Interconnection R. Prakash, and DK Panda, International Symposium on Parallel Architectures, Dec 1994 [Bib - Plain]
536	Designing Large Hierarchical Multiprocessor Systems under Processor D. Basak, and DK Panda, International Parallel Processing Conference (ICPP '94), Aug 1994 [Bib - Plain]
537	Message-Ordering for Wormhole-Routed Multiport Systems with Link Contention and Routing Adaptivity DK Panda, and V. Dixit-Radiya, Scalable High Performance Computing Conference, May 1994 [Bib - Plain]
538	Complete Exchange in 2D Meshes N. S. Sundar, D. N. Jayasimha, DK Panda, and P. Sadayappan, Scalable High Performance Computing Conference, May 1994 [Bib - Plain]
539	Multidestination Message Passing Mechanism Conforming to Base Wormhole Routing Scheme DK Panda, S. Singal, and P. Prabhakaran, Parallel Routing and Communication Workshop, May 1994 [Bib - Plain]
540	Scalable Architecture with k-ary n-cube cluster-c Organizations D. Basak, and DK Panda, Symposium on Parallel and Distributed Processing, Dec 1993 [Bib - Plain]
541	Task Assignment in Distributed-Memory Systems with Adaptive Wormhole Routing V. Dixit-Radiya, and DK Panda, Symposium on Parallel and Distributed Processing, Dec 1993 [Bib - Plain]
542	Optimal Phase Barrier Synchronization in k-ary n-cube Wormhole-routed Systems using Multirendezvous Primitives DK Panda, Workshop on Fine-Grain Massively Parallel Coordination, May 1993 [Bib - Plain]
543	Analysis of Routing in Pyramid Architectures T. Mzaik, S. Chandra, J. M. Jagadeesh, and DK Panda, IEEE National Aerospace and Electronics Conference (NAECON), May 1993 [Bib - Plain]
544	Benefits of Processor Clustering in Designing Large Parallel Systems: When and How? D. Basak, DK Panda, and M. Banikazemi, International Parallel Processing Symposium, Apr 1993 [Bib - Plain]
545	Global Reduction in Wormhole k-ary n-cube Networks with Multidestination Exchange Worms DK Panda, International Parallel Processing Symposium, Apr 1993 [Bib - Plain]
546	An Efficient Scheme for Complete Exchange in 2D Tori Y.-C. Tseng, S. K. S. Gupta, and DK Panda, International Parallel Processing Symposium, Apr 1993 [Bib - Plain]
547	Clustering and Intra-Processor Scheduling for Explicitly-Parallel Programs on Distributed-Memory Systems V. Dixit-Radiya, and DK Panda, International Parallel Processing Symposium, Apr 1993 [Bib - Plain]
548	Impact of Multiple Consumption Channels on Wormhole Routed k-ary n-cube Networks S. Balakrishnan, and DK Panda, International Parallel Processing Symposium, Apr 1993 [Bib - Plain]
549	Barrier Synchronization in Distributed-Memory Multiprocessors using Rendezvous Primitives S. K. S. Gupta, and DK Panda, International Parallel Processing Symposium, Apr 1993 [Bib - Plain]
550	A Trip-based Multicasting Model for Wormhole-routed Networks with Virtual Channels Y. C. Tseng, and DK Panda, International Parallel Processing Symposium, Apr 1993 [Bib - Plain]

Technical Reports (8)
1	K. Vaidyanathan, P. Lai, S. Narravula, and DK Panda, Benefits of Dedicating Resource Sharing Services in Data-Centers for Emerging Multi-Core Systems, OSU-CISRC-8/07-TR53
2	K. Vaidyanathan, H. Jin, S. Narravula, and DK Panda, Accurate Load Monitoring for Cluster-based Web Data-Centers over RDMA-enabled Networks OSU-CISRC-7/05-TR49
3	G. Marsh, A. Sampat, S. Potluri, and DK Panda, Scaling Advanced Message Queuing Protocol (AMQP) Architecture with Broker Federation and InfiniBand OSU Technical Report (OSU-CISRC-5/09-TR17)
4	W. Huang, J. Liu, B. Abali, and DK Panda, InfiniBand Support in Xen Virtual Machine Environment, OSU-CISRC-2/06--TR18
5	P. Balaji, W. Feng, and DK Panda, The Convergence of Ethernet and Ethernot: A 10-Gigabit Ethernet Perspective, OSU-CISRC-1/06-TR10
6	H. Jin, S. Narravula, G. Brown, K. Vaidyanathan, P. Balaji, and DK Panda, Performance Evaluation of RDMA over IP: A Case Study with Ammasso Gigabit Ethernet NIC, OSU-CISRC-6/05-TR40
7	K. Vaidyanathan, P. Balaji, J. Wu, H. Jin, and DK Panda, An Architectural Study of Cluster-Based Multi-Tier Data-Centers,
8	S. Krishnamoorthy, P. Balaji, K. Vaidyanathan, H. Jin, and DK Panda, Dynamic Reconfigurability Support for providing Soft QoS Guarantees in Cluster-based Multi-Tier Data-Centers over InfiniBand,

Ph.D. Disserations (45)
1	B. Ramesh, Designing High-Performance Architecture-aware Communication Middleware for Modern HPC Systems, May 2025
2	K. Suresh, Designing High-Performance Middleware Utilizing Smart Network Features for HPC Applications, Apr 2025
3	Q. Zhou, High Performance Communication Middleware with On-the-fly GPU-based Compression for HPC and Deep Learning Applications, Jul 2024
4	P. Kousha, Designing Conversational AI Enabled Communication Analysis and Profiling Tool for High-Performance Computing, Apr 2024
5	K. Khorassani, High-Performance, Adaptive, and Scalable GPU-aware MPI Libraries for Next-Generation Heterogeneous Systems, May 2023
6	A. Jain, Novel Parallelization Strategies for High-Performance DNN Training on HPC System, Dec 2022
7	M. Bayatpour, Designing High Performance Hardware-assisted Communication Middlewares for Next-Generation HPC Systems, May 2021
8	C. Chu, Accelerator-enabled Communication Middleware for Large-scale Heterogeneous HPC Systems with Modern Interconnects, Jul 2020
9	J. Hashmi, Designing High Performance Shared-Address-Space and Adaptive Communication Middlewares for Next-Generation HPC Systems, Apr 2020
10	Ammar Awan, Co-designing Communication Middleware and Deep Learning Frameworks for High-Performance DNN Training on HPC Systems, Apr 2020
11	D. Shankar, Designing Fast, Resilient and Heterogeneity-Aware Key-Value Storage for Modern HPC Clusters, Jul 2019
12	S. Chakraborty, High Performance and Scalable Cooperative Communication Middleware for Next Generation Architectures, Jun 2019
13	J. Zhang, Designing and Building Efficient HPC Cloud with Modern Networking Technologies on Heterogeneous HPC Clusters, Jul 2018
14	M. Li, Designing High-Performance Remote Memory Access for MPI and PGAS Models with Modern Networking Technologies on Heterogeneous Clusters, Nov 2017
15	A. Venkatesh, High-Performance Heterogeneity/Energy-Aware Communication for MultiPetaflop HPC Systems, Dec 2016
16	N. Islam, High-Performance File System and I/O Middleware Design for Big Data on HPC Clusters, Nov 2016
17	M. W. Rahman, Designing and Modeling High-Performance MapReduce and DAG Execution Framework on Modern HPC Systems, Nov 2016
18	R. Rajachandrasekar, Designing Scalable And Efficient I/O Middleware for Fault-Resilient High-performance Computing Clusters, Nov 2014
19	J. Jose, Designing High Performance and Scalable Unified Communication Runtime (UCR) for HPC and Big Data Middleware, Aug 2014
20	S. Potluri, Enabling Efficient Use of MPI and PGAS Programming Models on Heterogeneous Clusters with High Performance Interconnects, May 2014
21	K. Kandalla, High Performance Non-Blocking Collective Communication for Next Generation InfiniBand Clusters, Jul 2013
22	M. Luo, Designing Efficient MPI and UPC Runtime for Multicore Clusters with InfiniBand and Heterogeneous System, Jul 2013
23	H. Subramoni, Topology-Aware MPI communication and Scheduling for High Performance Computing Systems, Jul 2013
24	X. Ouyang, Efficient Storage Middleware Design in InfiniBand Clusters for High-End Computing, Mar 2012
25	G. Santhanaraman, Designing Scalable And High Performance One Sided Communication Middleware For Modern Interconnects, Jun 2009
26	M. Koop, High-Performance Multi-Transport MPI Design For Ultra-Scale Infiniband Clusters, Jun 2009
27	L. Chai, High Performance And Scalable MPI Intra-Node Communication Middleware For Multi-Core Clusters, Mar 2009
28	W. Huang, High Performance Network I/O In Virtual Machines Over Modern Interconnects, Aug 2008
29	R. Noronha, Designing High-Performance and Scalable Clustered Network Attached Storage With InfiniBand, Aug 2008
30	S. Narravula, Designing High-Performance and Scalable Distributed Datacenter Services over Modern Interconnects, Aug 2008
31	A. Mamidala, Scalable and High Performance Collective Communication For Next Generation Multicore InfiniBand Clusters, May 2008
32	K. Vaidyanathan, High Performance and Scalable Soft Shared State for Next-Generation Datacenters, May 2008
33	A. Vishnu, High Performance and Network Fault Tolerant MPI with Multi-Pathing Over InfiniBand, Dec 2007
34	S. Sur, Scalable and High Performance MPI Design for Very Large InfiniBand Clusters, Aug 2007
35	P. Balaji, High Performance Communication Support for Sockets Based Applications over High-Speed Networks, Jun 2006
36	W. Yu, Enhancing MPI with Modern Networking Mechanisms in Cluster Interconncts, Jun 2006
37	J. Wu, Communication and Memory Management in Networked Storage Systems, Sep 2004
38	J. Liu, Designing High Performance and Scalable MPI over InfiniBand, Sep 2004
39	D. Buntinas, Improving Cluster Performance through the Use of Programmable Network Interfaces, Jun 2003
40	M. Banikazemi, Design and Implementation of High Performance Communication Subsystems for Clusters, Dec 2000
41	D. Dai, Designing Efficient Communication Subsystems for Distributed Shared Memory (DSM) Systems, Mar 1999
42	R. Kesavan, Communication Mechanisms and Algorithms for Supporting Scalable Collective Communication on Parallel Systems, Oct 1998
43	R. Sivaram, Architectural Support for Efficient Communication in Scalable Parallel Systems, Aug 1998
44	D. Basak, Designing High Performance Parallel Systems: A Processor-Cluster Based Approach, Jul 1996
45	V. Dixit-Radiya, Mapping on Wormhole-routed Distributed-Memory Systems: A Temporal Communication Graph-based Approach, Mar 1995

M.S. Thesis (38)
1	M. Han, PML-MPI: A Pre-Trained ML Framework for Efficient Collective Algorithm Selection in MPI, May 2024
2	R. Gulhane, Accelerated and Memory-Efficient Distributed Deep Learning: Leveraging Hybrid Parallelism, Quantization, and Mix-Match Runtime Communication, Apr 2024
3	A. Paniraja Guptha, Enhancing OSU Micro-Benchmarks to be an All-In-One Solution for MPI Benchmarking, Jul 2023
4	K. Al Attar, Optimizing Apache Spark using the MVAPICH2 MPI library for High Performance Computing, May 2023
5	N. Sarkauskas, Large-Message Nonblocking MPI Iallgather and MPI Ibcast Offload via BlueField2 DPU, May 2022
6	S. Srivastava, MVAPICH2-AutoTune: An Automatic Collective Tuning Framework for the MVAPICH2 MPI Library, May 2021
7	N. Senthil Kumar, Designing Optimized MPI+NCCL Hybrid Collective Communication Routines for Dense Many-GPU Clusters, May 2021
8	K. Raj, Profiling MPI Primitives in Real-time Using OSU INAM, Apr 2020
9	R. Biswas, Benchmarking and Accelerating TensorFlow-based Deep Learning on Modern HPC Systems, Jul 2018
10	K. Kulkarni, Performance Characterization and Improvements of SQL-on-Hadoop Systems, Aug 2016
11	A. Augustine, Designing a Scalable Network Analysis and Monitoring Tool with MPI Support, Aug 2016
12	A. Bhat, RDMA-based Plugin Design and Profiler for Apache and Enterprise Hadoop Distributed Filesystem, Aug 2015
13	V. Dhanraj, Enhancement of LIMIC-Based Collectives for Multi-core Clusters, Aug 2012
14	A. Singh, Optimizing All-to-all and Allgather Communications on GPGPU Clusters, Apr 2012
15	S. Pai Raikar, Network Fault-Resilient MPI for Multi-Rail InfiniBand Clusters, Dec 2011
16	N. Dandapanthula, InfiniBand Network Analysis and Monitoring using OpenSM, Aug 2011
17	V. Meshram, Distributed Metadata Management for Parallel Systems, Aug 2011
18	G. Marsh, Evaluation of High Performance Financial Messaging on Modern Multi-core Systems, Mar 2010
19	K. Gopalakrishnan, Enhancing Fault Tolerance in MPI for Modern InfiniBand Clusters, Aug 2009
20	T. Gangadharappa, Designing Support For MPI-2 Programming Interfaces On Modern Interconnects, Jun 2009
21	J. Sridhar, Scalable Job Startup And Inter-Node Communication In Multi-Core Infiniband Clusters, Jun 2009
22	R. Kumar, Enhancing MPI Point-to-Point and Collectives for Clusters with Onloaded/Offloaded InfiniBand Adapters, Aug 2008
23	S. Bhagvat, Designing and Enhancing the Sockets Direct Protocol (SDP) over iWARP and InfiniBand, Aug 2006
24	S. Krishnamoorthy, Dynamic Re-Configurability Support to Provide Soft QoS Guarantees in Cluster-Based Multi-Tier Data-Centers over InfiniBand, Jun 2004
25	W. Jiang, High Performance MPICH2 One-Sided Communication Implementation over InfiniBand, Jun 2004
26	A. Wagner, Static and Dynamic Processing Offload on Myrinet Clusters with Programmable NIC Support, Jun 2004
27	A. Moody, NIC-based Reduction on Large-Scale Quadrics Clusters, Dec 2003
28	B. Chandrasekharan, Micro-benchmark Level Performance Evaluation and Comparison of High Speed Cluster Interconnects, Sep 2003
29	S. Kini, Efficient Collective Communication using Multicast and RDMA Operations for InfiniBand-based Clusters, Jun 2003
30	S. Senapathi, QoS-Aware Middleware to Support Interactive and Resource Adaptive Applications on Myrinet Clusters, Sep 2002
31	P. Shivam, High Performance User Level Protocol on Gigabit Ethernet, Aug 2002
32	R. Gupta, Efficient Collective Communication using Remote Memory Operations on VIA-Based Clusters, Aug 2002
33	A. Saify, Optimizing Collective Communication Operations in ARMCI, Jul 2002
34	S. Desai, Mechanisms for Implementing Efficient Collective Communication in Clusters with Application Bypass, Jun 2002
35	V. Tipparaju, Optimizing ARMCI Get/Put Operations on Myrinet/GM, Sep 2001
36	A. Gulati, A Proportional Bandwidth Allocation Scheme for Myrinet Clusters, Jun 2001
37	V. Kota, Designing Efficient Inter-Cluster Communication Layer for Distributed Computing, Jun 2001
38	S. Kutlug, Performance Evaluation and Analysis of User Level Networking Protocols in Clusters, Jun 2000

B.S. Thesis (2)
1	L. Xu, Scalable Neural Network Architecture Search Applied to Super-Resolution Networks, May 2022
2	N. Sarkauskas, Framework for End-to-End Tuning and Regression for a High Performance MPI Library on Modern Supercomputers, May 2021

NOWLAB: Network Based Computing Lab

This page lists the publications from the NOWLAB members

Book (1)

Journals (42)

Book Chapter (3)

Conferences & Workshops (550)

HAT-MPI: Hierarchical Auto Tuning of MPI Inter-Node Communication on InfiniBand Clusters

Design and Implementation of Multi-Rail-Aware Hierarchical MPI Reduce-Scatter and Allgather Operations

Understanding Buffer Allocation and Data Transfer Mechanisms on AMD MI300A APUs

NIMBLE: Node-Interconnect Multi-Path Balancing with On-the-fly Orchestration for High Bandwidth GPU Clusters

Multi-Channel DMA-Accelerated MPI Intra-Node Communication: A Hybrid Adaptive Framework with Memory Copy Offloading

Design and Implementation of Casting Compression for GPU-Aware MPI Collectives

From Skew to Symmetry: Node-Interconnect Multi-Path Balancing with Execution-time Planning for Modern GPU Clusters

One Memory-Many Paths: Early Experiences with Allocation and Data Copy Strategies on MI300A

MAC-Attention: a Match-Amend-Complete Scheme for Fast and Accurate Attention Computation

Design and Optimization of GPU-Aware MPI Allreduce Using Direct Sendrecv Communication

HyperSack: Resource-Aware Distributed Hyperparameter Optimization for Lightweight Vision and Language Models on NVIDIA GPU Systems

Supporting Ultra-High-Resolution Digital Agriculture Tasks with Fully Synthetic Curriculum Learning.

Performance Characterization of Data Transfer and Allocation Strategies on AMD MI300A APUs: Early Experiences

Enhanced MPI Intra-node Communication Framework: A Hybrid Approach with Cooperative DMA Channel-based Data Transfer

A Streaming Collectives Interface Targeting Dataflow Acceleration and HPC Workloads

OpenSHMEM MLIR: A Dialect for Compile-Time Optimization of One-Sided Communications

MPI Communication Performance on AMD MI300A: Microbenchmarks and Applications

Design and Optimization of GPU-Aware MPI Allreduce Using Direct Sendrecv Communication

HARVEST Inference: Characterizing Digital Agriculture Workloads across Compute Continuum

Towards Dynamic Message Passing Protocols for Stencil-Based Communication Patterns

Characterizing Communication Patterns in Distributed Large Language Model Inference

OMB-Compr: An Extension to OSU Micro Benchmarks for Collective Compression Error Measurement

OpenSHMEM Performance on Bluefield-3 Data Processing Units (DPUs)

Use of BlueField-SmartNICs in Offloading One-Sided Communication Primitives

Design and Implementation of MPI Collective Operations for Large Message Communication on AMD GPUs

Design and Implementation of a GPU-Aware MPI Collective Library for Intel GPUs

Unified Designs of Multi-rail-aware MPI Allreduce and Alltoall Operations Across Diverse GPU and Interconnect Systems

Training ultra long context language model with fully pipelined distributed transformer

Effective and Efficient Offloading Designs for One-Sided Communication to SmartNICs

Using BlueField-3 SmartNICs to Offload Vector Operations in Krylov Subspace Methods

Design and Implementation of Kernel-based MPI Reduction Operations for Intel GPUs

Scaling Large Language Model Training on Frontier with Low-Bandwidth Partitioning

HyperSack: Distributed Hyperparameter Optimization for Deep Learning using Resource-Aware Scheduling on Heterogeneous GPU Systems

HARVEST-2.0: High-Performance Vision Framework for End-to-end Preprocessing, Training, Inference, and Visualization

Demystifying the Communication Characteristics for Distributed Transformer Models

Characterizing Communication in Distributed Parameter-Efficient Fine-Tuning for Large Language Models

OHIO: Improving RDMA Network Scalability in MPI_Alltoall through Optimized Hierarchical and Intra/Inter-Node Communication Overlap Design

The Case for Co-Designing Model Architectures with Hardware

Design and Implementation of an IPC-based Collective MPI Library for Intel GPUs

OMB-CXL: A Micro-Benchmark Suite for Evaluating MPI Communication Utilizing Compute Express Link Memory Devices

A Novel LLM-enabled Framework for Accelerating the Creation of Knowledge Graphs for HPC

OMB-FPGA: A Microbenchmark Suite for FPGA-aware MPIs using OpenCL and SYCL

Infer-HiRes: Accelerating Inference for High-Resolution Images with Quantization and Distributed Deep Learning

PML-MPI: A Pre-Trained ML Framework for Efficient Collective Algorithm Selection in MPI

Exploiting Inter-Layer Expert Affinity for Accelerating Mixture-of-Experts Model Inference

HINT: Designing Cache-Efficient MPI_Alltoall using Hybrid Memory Copy Ordering and Non-Temporal Instructions

Accelerating MPI AllReduce Communication with Efficient GPU-Based Compression Schemes on Modern GPU Clusters

Profiling, Storing and Monitoring HPC Communication Data at Scale by OSU INAM

High-Performance Semi-Supervised Learning with HARVEST: A Distributed Computer Vision Framework for Expert Labeling

Accelerating Large Language Model Training with Hybrid GPU-based Compression

AccDP: Accelerated Data-Parallel Distributed DNN Training for Modern GPU-Based HPC Clusters

Optimized All-to-all Connection Establishment for High-Performance MPI Libraries over InifiniBand

Flover: A Temporal Fusion Framework for Efficient Autoregressive Model Parallel Inference

MPI Allgather Utilizing CXL Shared Memory Pool in Multi-Node Computing Systems

HARVEST: High-Performance Artificial Vision Framework for Expert Labeling using Semi-Supervised Training

MPI4Spark Meets YARN: Enhancing MPI4Spark through YARN support for HPC

Benchmarking Modern Databases for Storing and Profiling Very Large Scale HPC Communication Data

MPI-xCCL: A Portable MPI Library over Collective Communication Libraries for Various Accelerators

Designing In-network Computing Aware Reduction Collectives in MPI

Battle of the BlueFields: An In-Depth Comparison of the BlueField-2 and BlueField-3 SmartNICs

DPU-Bench: A Micro-Benchmark Suite to Measure Offload Efficiency Of SmartNICs

Optimizing Amber for Device-to-Device GPU Communication

Enabling Reconfigurable HPC through MPI-based Inter-FPGA Communication

SAI: AI-Enabled Speech Assistant Interface for Science Gateways in HPC

A Novel Framework for Efficient Offloading of Communication Operations to Bluefield SmartNICs

Accelerating Distributed Deep Learning Training with Compression Assisted Allgather and Reduce-Scatter Communication

MCR-DL: Mix-and-Match Communication Runtime for Deep Learning

Designing and Optimizing GPU-aware Nonblocking MPI Neighborhood Collective Communication for PETSc

In-Depth Evaluation of a Lower-Level Direct-Verbs API on InfiniBand-based Clusters: Early Experiences

Implementing and Optimizing a GPU-aware MPI Library for Intel GPUs: Early Experiences

ScaMP: Scalable Meta-Parallelism for Deep Learning Search

Performance Characterization of using Quantization for DNN Inference on Edge Devices

Efficient Personalized and Non-Personalized Alltoall Communication for Modern Multi-HCA GPU-Based Clusters

AccDP: Accelerated Data-Parallel Distributed DNN Training for Modern GPU-Based HPC Clusters