A Tutorial on


High-Performance and Smart Networking Technologies for HPC and AI


In conjunction with SCAsia 2025


Monday March 10th 2025


Singapore


by


Dhabaleswar K. Panda, Hari Subramoni, Benjamin Michalowicz


The Ohio State University

Abstract

As InfiniBand (IB), High-speed Ethernet (HSE), RoCE, and Omni-Path technologies mature, they are being used to design and deploy various High-End Computing (HEC) systems: HPC clusters with GPGPUs supporting MPI, Storage and Parallel File Systems, Cloud Computing systems with SR-IOV Virtualization, Grid Computing systems, and Deep Learning systems. These systems are bringing new challenges in terms of performance, scalability, portability, reliability and network congestion. Many scientists, engineers, researchers, managers and system administrators are becoming interested in learning about these challenges, approaches being used to solve these challenges, and the associated impact on performance and scalability. This tutorial will start with an overview of these systems. Advanced hardware and software features of IB, Omni-Path, HSE, and RoCE and their capabilities to address these challenges will be emphasized. Next, we will focus on Open Fabrics RDMA and Libfabrics programming, and network management infrastructure and tools to effectively use these systems. A common set of challenges being faced while designing these systems will be presented. Case studies focusing on domain-specific challenges in designing these systems, their solutions and sample performance numbers will be presented. Finally, hands-on exercises will be carried out with Open Fabrics and Libfabrics software stacks and Network Management tools.

Outline

Presenters

Dhabaleswar K. Panda

Dhabaleswar K. Panda

DK Panda is a Professor and University Distinguished Scholar of Computer Science and Engineering at the Ohio State University. He is serving as the Director of the ICICLE NSF-AI Institute (https://icicle.ai). He has published over 500 papers. The MVAPICH MPI libraries, designed and developed by his research group (http://mvapich.cse.ohio-state.edu), are currently being used by more than 3,400 organizations worldwide (in 92 countries). More than 1.8 million downloads of this software have taken place from the project's site. This software is empowering many clusters in the TOP500 list. High-performance and scalable solutions for DL/ML frameworks from his group are available from https://hidl.cse.ohio-state.edu. Similarly, scalable and high-performance solutions for Big Data and Data science frameworks are available from https://hibd.cse.ohio-state.edu. Prof. Panda is an IEEE Fellow and recipient of the 2022 IEEE Charles Babbage and 2024 TCPP Outstanding Service and Contribution Awards. More details about Prof. Panda are available at http://www.cse.ohio-state.edu/~panda.

Hari Subramoni

Hari Subramoni

Hari is an Assistant Professor in the Department of Computer Science and Engineering at the Ohio State University, USA. His current research interests include high-performance interconnects and protocols, parallel computer architecture, network-based computing, exascale computing, network topology-aware computing, QoS, power-aware LAN-WAN communication, fault tolerance, virtualization, big data, deep learning, machine learning, conversational interfaces, digital agriculture, and cloud computing. He has co-authored over 170 publications in international journals and conferences related to these research areas. He is a member of ACM and IEEE.

Benjamin Michalowicz

Benjamin Michalowicz

Ben Michalowicz is a 4th year PhD student at the Ohio State University under Prof. DK Panda and Prof. Hari Subramoni in the Network-Based Computing Laboratory. His research interests lie include high-performance computing (HPC), parallel architectures, network-based computing for HPC, and parallel programming environments. Specifically, he is interested in efficiently offloading workloads to Smart Network Cards like NVIDIA's BlueField DPUs. Ben actively contributes to the MVAPICH software.