High-Performance Deep Learning and Machine Learning

Overview

The availability of large data sets (e.g. ImageNet, PASCAL VOC 2012) coupled with massively parallel processors in modern HPC systems (e.g. NVIDIA GPUs) have fueled a renewed interest in Deep Learning (DL) and Machine Learning (ML) models. In addition to the popularity of massively parallel DL/ML accelerators like GPUs, the availability and memory-abundance of modern CPUs poses a viable alternative for DL/ML training. This resurgence of DL/ML applications has triggered the development of DL frameworks like PyTorch, TensorFlow, LBANN, and Apache MXNet as well as ML frameworks like Scikit-Learn and cuML. While most DL/ML frameworks provide experimental support for multi-node training, their distributed implementation is often suboptimal. Further, the emergence of distributed DL frameworks such as Horovod and DeepSpeed introduce novel parallelism challenges.

Objectives

The objective of the HiDL/HiML projects are to design and implement novel parallelization strategies to train next-generation out-of-core models, and to exploit modern HPC technologies and solutions to fundamentally improve the performance of distributed DL/ML training and inference.

Software Distribution

Link to High-Performance Deep Learning and Machine Learning

Results

Link to HiDL Results
Link to HiML Results

Conferences & Workshops (15)
1	Demystifying the Communication Characteristics for Distributed Transformer Models Q. Anthony, B. Michalowicz, J. Hatef, L. Xu, M. Abduljabbar, A. Shafi, H. Subramoni, and DK Panda, IEEE Hot Interconnects Symposium 2024, Aug 2024 [Q. Anthony and B. Michalowicz are co-lead authors] [Bib - Plain]
2	Characterizing Communication in Distributed Parameter-Efficient Fine-Tuning for Large Language Models N. Alnaasan, H. Huang, A. Shafi, H. Subramoni, and DK Panda, IEEE Hot Interconnects Symposium 2024, Aug 2024 [Bib - Plain]
3	The Case for Co-Designing Model Architectures with Hardware Q. Anthony, J. Hatef, Deepak Narayanan, Stella Biderman, Stas Bekman, Junqi Yin, A. Shafi, H. Subramoni, and DK Panda, 53rd International Conference on Parallel Processing, Aug 2024 [Bib - Plain]
4	A Novel LLM-enabled Framework for Accelerating the Creation of Knowledge Graphs for HPC P. Kousha, V. Sathu, H. M. Han, J. Jani, N. Alnaasan, H. Subramoni, and DK Panda, Practice and Experience in Advanced Research Computing, Jul 2024 [Jul 2024] [Bib - Plain]
5	Infer-HiRes: Accelerating Inference for High-Resolution Images with Quantization and Distributed Deep Learning R. Gulhane, Q. Anthony, H. Subramoni, and DK Panda, Practice and Experience in Advanced Research Computing, Jul 2024 [Jul 2024] [Bib - Plain]
6	Accelerating MPI AllReduce Communication with Efficient GPU-Based Compression Schemes on Modern GPU Clusters Q. Zhou, B. Ramesh, A. Shafi, M. Abduljabbar, H. Subramoni, and DK Panda, ISC HIGH PERFORMANCE 2024, May 2024 [Bib - Plain]
7	Accelerating Distributed Deep Learning Training with Compression Assisted Allgather and Reduce-Scatter Communication Q. Zhou, Q. Anthony, L. Xu, A. Shafi, M. Abduljabbar, H. Subramoni, and DK Panda, 37th IEEE International Parallel & Distributed Processing Symposium (IPDPS '23), May 2023 [Bib - Plain]
8	MCR-DL: Mix-and-Match Communication Runtime for Deep Learning Q. Anthony, Ammar Awan, J. Rasley, Y. He, A. Shafi, M. Abduljabbar, H. Subramoni, and DK Panda, 37th IEEE International Parallel & Distributed Processing Symposium (IPDPS '23), May 2023 [Bib - Plain]
9	ScaMP: Scalable Meta-Parallelism for Deep Learning Search Q. Anthony, L. Xu, A. Shafi, H. Subramoni, and DK Panda, THE 23RD IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing, May 2023 [Bib - Plain]
10	Performance Characterization of using Quantization for DNN Inference on Edge Devices H. Ahn, T. Chen, N. Alnaasan, A. Shafi, M. Abduljabbar, H. Subramoni, and DK Panda, 7TH IEEE INTERNATIONAL CONFERENCE ON FOG AND EDGE COMPUTING, May 2023 [Bib - Plain]
11	AccDP: Accelerated Data-Parallel Distributed DNN Training for Modern GPU-Based HPC Clusters N. Alnaasan, A. Jain, A. Shafi, H. Subramoni, and DK Panda, 29th IEEE International Conference on High Performance Computing, Data, and Analytics, Dec 2022 [Bib - Plain]
12	Accelerating Broadcast Communication with GPU Compression for Deep Learning Workloads Q. Zhou, Q. Anthony, A. Shafi, H. Subramoni, and DK Panda, 29th IEEE International Conference on High Performance Computing, Data, and Analytics, Dec 2022 [Bib - Plain]
13	OMB-Py: Python Micro-Benchmarks for Evaluating Performance of MPI Libraries on HPC Systems N. Alnaasan, A. Jain, A. Shafi, H. Subramoni, and DK Panda, 23rd Parallel and Distributed Scientific and Engineering Computing Workshop (PDSEC) at IPDPS22, May 2022 [Bib - Plain]
14	Hy-Fi: Hybrid Five-Dimensional Parallel DNN Training on High-Performance GPU Clusters A. Jain, A. Shafi, Q. Anthony, P. Kousha, H. Subramoni, and DK Panda, ISC HIGH PERFORMANCE 2022, May 2022 [Bib - Plain]
15	Scaling Single-Image Super-Resolution Training on Modern HPC Clusters: Early Experiences Q. Anthony, L. Xu, H. Subramoni, and DK Panda, Scalable Deep Learning over Parallel And Distributed Infrastructures, May 2021 [Bib - Plain]

Ph.D. Disserations (1)
1	M. Bayatpour, Designing High Performance Hardware-assisted Communication Middlewares for Next-Generation HPC Systems, May 2021

M.S. Thesis (2)
1	S. Srivastava, MVAPICH2-AutoTune: An Automatic Collective Tuning Framework for the MVAPICH2 MPI Library, May 2021
2	N. Senthil Kumar, Designing Optimized MPI+NCCL Hybrid Collective Communication Routines for Dense Many-GPU Clusters, May 2021

NOWLAB: Network Based Computing Lab