Extreme Scale computing in HPC, Big Data, Deep Learning and Clouds are marked by multiple-levels of hierarchy and heterogeneity ranging from the compute units (many-core CPUs, GPUs, APUs etc) to storage devices (NVMe, NVMe over Fabrics etc) to the network interconnects (InfiniBand, High-Speed Ethernet, Omni-Path etc). Owing to the plethora of heterogeneous communication paths with different cost models expected to be present in extreme scale systems, data movement is seen as the soul of different challenges for exascale computing. On the other hand, advances in networking technologies such as NoCs (like NVLink), RDMA enabled networks and the likes are constantly pushing the envelope of research in the field of novel communication and computing architectures for extreme scale computing. The goal of this workshop is to bring together researchers and software/hardware designers from academia, industry and national laboratories who are involved in creating network-based computing solutions for extreme scale architectures. The objectives of this workshop will be to share the experiences of the members of this community and to learn the opportunities and challenges in the design trends for exascale communication architectures.

ExaComm 2024 welcomes original submissions in a range of areas, including but not limited to:

  • Scalable communication protocols
  • High performance networks
  • Runtime/middleware designs
  • Impact of high performance networks on Deep Learning / Machine Learning
  • Impact of high performance networks on Big Data
  • Novel hardware/software co-design
  • High performance communication solutions for accelerator based computing
  • Power-aware techniques and designs
  • Performance evaluations
  • Quality of Service (QoS)
  • Resource virtualization and SR-IOV

Keynote Address


Speaker

Sadaf Alam,, University of Bristol, UK

Abstract

Title: Maximising Sustainability of Isambard AI Exascale Supercomputing Platform, from Data Centre to Compute Nodes

Isambard AI is one the national UK’s Artificial Intelligence Research Resources (RR) that will offer Exascale AI compute capabilities. The AI RR will be available to research communities aligned with the stated mission of investigating safety and trustworthiness of AI models, large language models (LLMs) and foundational AI topics that are expected to significantly influence sciences and our societies. Since AI compute is highly demanding, reported to take several tens to hundreds of thousands of GPU hours to train LLMs, it is imperative that these systems are designed with sustainability in mind as we face climate emergency. This talk overviews sustainability and performance features of Isambard AI, which have been our guiding principles from designing the data centre to individual computing node solutions. Isambard AI exascale platform is deployed in a modular, containerised data centre. Direct liquid cooled Cray HPE EX cabinets offer maximum power efficiencies and a small physical footprint. Nvidia Grace-Hopper GH200 superchips are optimised for energy efficiency for data movement in addition to AI compute horsepower. Overall, we carefully consider University of Bristol Net Zero by 2030 target and report on scope 1, 2 and 3 emissions. The talk will include updates on Isambard AI phase 1 that was installed from zero (no data centre) to running AI workloads in less than 6 months.

Invited Talks


Panel

Title

Do we need special-purpose networking technologies for handling AI workloads?

Moderator

Nectarios Koziris, National Technical University of Athens, Greece

Members

Organizing Committee


Program Chairs

Web and Publicity Chair