NOWLAB :: High-Performance and Smart Networking Technologies for HPC and AI

Abstract

As InfiniBand (IB), High-speed Ethernet (HSE), RoCE, and Omni-Path technologies mature, they are being used to design and deploy various High-End Computing (HEC) systems: HPC clusters with GPGPUs supporting MPI, Storage and Parallel File Systems, Cloud Computing systems with SR-IOV Virtualization, Grid Computing systems, and Deep Learning systems. These systems are bringing new challenges in terms of performance, scalability, portability, reliability and network congestion. Many scientists, engineers, researchers, managers and system administrators are becoming interested in learning about these challenges, approaches being used to solve these challenges, and the associated impact on performance and scalability. This tutorial will start with an overview of these systems. Advanced hardware and software features of IB, Omni-Path, HSE, and RoCE and their capabilities to address these challenges will be emphasized. Next, we will focus on Open Fabrics RDMA and Libfabrics programming, and network management infrastructure and tools to effectively use these systems. A common set of challenges being faced while designing these systems will be presented. Case studies focusing on domain-specific challenges in designing these systems, their solutions and sample performance numbers will be presented. Finally, hands-on exercises will be carried out with Open Fabrics and Libfabrics software stacks and Network Management tools.

Outline

Trends in High-End Computing
Why High-Performance Networking for HPC and AI?
- TCP vs User-level communication protocols
- Requirements (communication, I/O, performance, cost, RAS) from the perspective of designing next generation high-end systems and scalable data centers
- Communication Model and Semantics of High-Performance Networks
Communication Model and Semantics of High-Performance Networks
Architectural Overview of High-Performance Networks
- Architecture Overview
- Convergence
- Ultra-Ethernet Consortium (UEC)
Overview of Modern Scale-Out RDMA Networks
- Omni-Path Interconnect Architecture
- Amazon EFA Interconnect Architecture
- Cray/HPE Slingshot Interconnect Architecture
Overview of Scale-Up Networks
- NVIDIA NVLink/NVSwitch Interconnect Architecture
- AMD Infinity Fabric/xGMI Interconnect Architecture
- Ultra-Accelerator Link Consortium (UALink)
Overview of Software Stacks for Commodity High-Performance Networks
- OpenFabrics
- UCX and Libfabrics/Open Fabrics Interface (OFI)
GPU-Aware Communication
- GPU-Direct RDMA (GDR)
Overview of Emerging Smart Network Interfaces
- Collectives with NVIDIA SHARP
- Architectural features and principles of offloading (NVIDIA BlueField DPUs
- AMD Pensando, Intel Columbiaville
High-Performance Network Deployments for AI Workloads
- Overview and architectural features of Cerebras WSE
- Overview and architectural features of Habana Gaudi
Network trends in the top500
Sample Case Studies and Performance Numbers
Hands-on Exercises
- Evaluating and understanding the performance of high-performance networks at the fabric level
- Evaluating and understanding the performance of high-performance networks at the MPI level
Conclusions and Final Q&A, and Discussion

Presenters

Dhabaleswar K. Panda

DK Panda is a Professor and University Distinguished Scholar of Computer Science and Engineering at the Ohio State University. He is serving as the Director of the ICICLE NSF-AI Institute (https://icicle.ai). He has published over 500 papers. The MVAPICH MPI libraries, designed and developed by his research group (http://mvapich.cse.ohio-state.edu), are currently being used by more than 3,450 organizations worldwide (in 93 countries). More than 1.95 million downloads of this software have taken place from the project's site. This software is empowering many clusters in the TOP500 list. High-performance and scalable solutions for DL/ML frameworks from his group are available from https://hidl.cse.ohio-state.edu. Similarly, scalable and high-performance solutions for Big Data and Data science frameworks are available from https://hibd.cse.ohio-state.edu. Prof. Panda is an IEEE Fellow, an ACM Fellow, and the recipient of the 2022 IEEE Charles Babbage and 2024 TCPP Outstanding Service and Contribution Awards. More details about Prof. Panda are available at http://www.cse.ohio-state.edu/~panda.

Benjamin Michalowicz

Ben Michalowicz is a 5th year PhD student at the Ohio State University under Prof. DK Panda in the Network-Based Computing Laboratory. His research interests lie include high-performance computing (HPC), parallel architectures, network-based computing for HPC, and parallel programming environments. Specifically, he is interested in efficiently offloading workloads to Smart Network Cards like NVIDIA's BlueField DPUs. Ben actively contributes to the MVAPICH software.

NOWLAB: Network Based Computing Lab

A Tutorial on

High-Performance and Smart Networking Technologies for HPC and AI

In conjunction with SCA/HPCAsia 2026

Monday, January 26, 2026, 9:30 - 12:30 Japan Time

Osaka, Japan

by

Dhabaleswar K. Panda, Benjamin Michalowicz

The Ohio State University

Abstract

Outline

Presenters

Dhabaleswar K. Panda

Benjamin Michalowicz