A Tutorial on
Principles and Practice of Scalable and Distributed Deep Neural Networks Training and Inference
In conjunction with ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '25) in Las Vegas, NV
March 1st (8:00 AM - 12:00 PM Pacific Time), 2025
Las Vegas, NV
by
Dhabaleswar K. Panda, Hari Subramoni, Nawras Alnaasan
The Ohio State University
Abstract
Recent advances in Deep Learning (DL) have led to many exciting challenges and opportunities. Modern DL frameworks including TensorFlow, PyTorch, Horovod, and DeepSpeed enable high-performance training, inference, and deployment for various types of Deep Neural Networks (DNNs) such as GPT, BERT, ViT, and ResNet. This tutorial provides an overview of recent trends in DL and the role of cutting-edge hardware architectures and interconnects in moving the field forward. We will also present an overview of different DNN architectures, DL frameworks and DL Training and Inference with special focus on parallelization strategies for model training. We highlight new challenges and opportunities for communication runtimes to exploit high-performance CPU/GPU architectures to efficiently support large-scale distributed training. We also highlight some of our co-design efforts to utilize MPI for large-scale DNN training on cutting-edge CPU and GPU architectures available on modern HPC clusters. Throughout the tutorial, we include several hands-on exercises to enable attendees to gain first-hand experience of running distributed DL training and inference on a modern GPU cluster.
Outline
-
Introduction
- The Past, Present, and Future of Artificial Intelligence (AI)
- Brief History and Current/Future Trends of Machine Learning (ML) and Deep Learning (DL)
- What are Deep Neural Networks?
- Deep Learning Frameworks
- Deep Neural Network Training
-
Distributed Data-Parallel Training
- Basic Principles and Parallelization Strategies
- Hands-on Exercises (Data Parallelism) using PyTorch and TensorFlow
-
Latest Trends in High-Performance Computing Architectures
- HPC Hardware
- Communication Middleware
-
Advanced Distributed Training
- State-of-the-art approaches using CPUs and GPUs
- Hands-on Exercises (Advanced Parallelism) using DeepSpeed
-
Distributed Inference Solutions
- Overview of DL Inference
- Case studies
- Open Issues and Challenges
- Conclusions and Final Q&A
Presenters
Dhabaleswar K. Panda

DK Panda is a Professor and University Distinguished Scholar of Computer Science and Engineering at the Ohio State University. He is serving as the Director of the ICICLE NSF-AI Institute (https://icicle.ai). He has published over 500 papers. The MVAPICH MPI libraries, designed and developed by his research group (http://mvapich.cse.ohio-state.edu), are currently being used by more than 3,400 organizations worldwide (in 92 countries). More than 1.8 million downloads of this software have taken place from the project's site. This software is empowering many clusters in the TOP500 list. High-performance and scalable solutions for DL/ML frameworks from his group are available from https://hidl.cse.ohio-state.edu. Similarly, scalable and high-performance solutions for Big Data and Data science frameworks are available from https://hibd.cse.ohio-state.edu. Prof. Panda is an IEEE Fellow and recipient of the 2022 IEEE Charles Babbage and 2024 TCPP Outstanding Service and Contribution Awards. More details about Prof. Panda are available at http://www.cse.ohio-state.edu/~panda.
Hari Subramoni

Dr. Hari Subramoni is an assistant professor in the Department of Computer Science and Engineering at the Ohio State University, USA. His current research interests include high performance interconnects and protocols, parallel computer architecture, network-based computing, exascale computing, network topology aware computing, QoS, power-aware LAN-WAN communication, fault tolerance, virtualization, big data, digital agriculture, distributed deep/machine learning, and cloud computing. He has published over 150 papers in international journals and conferences related to these research areas. He has been actively involved in various professional activities in academic journals and conferences. He is a member of IEEE and ACM. More details about Dr. Subramoni are available from https://cse.osu.edu/people/subramoni.1.
Nawras Alnaasan
Nawras Alnaasan is a Graduate Research Associate at the Network-Based Computing Laboratory, Columbus, OH, USA, currently pursuing a Ph.D. in computer science and engineering at The Ohio State University. His research interests lie at the intersection of deep learning and high-performance computing. He works on advanced parallelization techniques to accelerate Deep Neural Network training and optimize HPC resource utilization, covering a range of DL applications, including supervised learning, semi-supervised learning, and hyperparameter optimization. He is actively involved in several projects, including HiDL (High-performance Deep Learning) and ICICLE (Intelligent Cyberinfrastructure with Computational Learning in the Environment). Alnaasan received his B.S. in computer science and engineering from The Ohio State University.