ExaComm :: ExaComm 2024

Extreme Scale computing in HPC, Big Data, Deep Learning and Clouds are marked by multiple-levels of hierarchy and heterogeneity ranging from the compute units (many-core CPUs, GPUs, APUs etc) to storage devices (NVMe, NVMe over Fabrics etc) to the network interconnects (InfiniBand, High-Speed Ethernet, Omni-Path etc). Owing to the plethora of heterogeneous communication paths with different cost models expected to be present in extreme scale systems, data movement is seen as the soul of different challenges for exascale computing. On the other hand, advances in networking technologies such as NoCs (like NVLink), RDMA enabled networks and the likes are constantly pushing the envelope of research in the field of novel communication and computing architectures for extreme scale computing. The goal of this workshop is to bring together researchers and software/hardware designers from academia, industry and national laboratories who are involved in creating network-based computing solutions for extreme scale architectures. The objectives of this workshop will be to share the experiences of the members of this community and to learn the opportunities and challenges in the design trends for exascale communication architectures.

ExaComm 2024 welcomes original submissions in a range of areas, including but not limited to:

Scalable communication protocols
High performance networks
Runtime/middleware designs
Impact of high performance networks on Deep Learning / Machine Learning
Impact of high performance networks on Big Data
Novel hardware/software co-design
High performance communication solutions for accelerator based computing
Power-aware techniques and designs
Performance evaluations
Quality of Service (QoS)
Resource virtualization and SR-IOV

Keynote Address

Speaker

Sadaf Alam, University of Bristol, UK

Abstract

Title: Maximising Sustainability of Isambard AI Exascale Supercomputing Platform, from Data Centre to Compute Nodes

Isambard AI is one the national UK’s Artificial Intelligence Research Resources (RR) that will offer Exascale AI compute capabilities. The AI RR will be available to research communities aligned with the stated mission of investigating safety and trustworthiness of AI models, large language models (LLMs) and foundational AI topics that are expected to significantly influence sciences and our societies. Since AI compute is highly demanding, reported to take several tens to hundreds of thousands of GPU hours to train LLMs, it is imperative that these systems are designed with sustainability in mind as we face climate emergency. This talk overviews sustainability and performance features of Isambard AI, which have been our guiding principles from designing the data centre to individual computing node solutions. Isambard AI exascale platform is deployed in a modular, containerised data centre. Direct liquid cooled Cray HPE EX cabinets offer maximum power efficiencies and a small physical footprint. Nvidia Grace-Hopper GH200 superchips are optimised for energy efficiency for data movement in addition to AI compute horsepower. Overall, we carefully consider University of Bristol Net Zero by 2030 target and report on scope 1, 2 and 3 emissions. The talk will include updates on Isambard AI phase 1 that was installed from zero (no data centre) to running AI workloads in less than 6 months.

Invited Talks

Murali Krishna Emani, Argonne National Laboratory
Douglas Fuller, Cornelis Networks
Shrijeet Mukherjee, Enfabrica
Gilad Shainer, NVIDIA
Debendra Das Sharma, Intel

Panel

Title

Do we need special-purpose networking technologies for handling AI workloads?

Moderator

Nectarios Koziris, National Technical University of Athens, Greece

Members

Murali Krishna Emani, Argonne National Laboratory
Steve Scott, AMD
John Shalf, Lawrence Berkeley National Laboratory (LBNL)

Organizing Committee

Program Chairs

Hari Subramoni, The Ohio State University
Dhabaleswar K. (DK) Panda, The Ohio State University
Aamir Shafi, The Ohio State University

Web and Publicity Chair

Mustafa Abduljabbar, The Ohio State University

ExaComm 2024

Ninth International Workshop on Communication Architectures for

HPC, Big Data, Deep Learning and Clouds at Extreme Scale

In conjunction with International Supercomputing Conference (ISC 2024)

Thursday, May 16, 2024

Hamburg, Germany

Related links