Extreme Scale computing in HPC, Big Data, Deep Learning and Clouds are marked by multiple-levels of hierarchy and heterogeneity ranging from the compute units (many-core CPUs, GPUs, APUs etc) to storage devices (NVMe, NVMe over Fabrics etc) to the network interconnects (InfiniBand, High-Speed Ethernet, Omni-Path etc). Owing to the plethora of heterogeneous communication paths with different cost models expected to be present in extreme scale systems, data movement is seen as the soul of different challenges for exascale computing. On the other hand, advances in networking technologies such as NoCs (like NVLink), RDMA enabled networks and the likes are constantly pushing the envelope of research in the field of novel communication and computing architectures for extreme scale computing. The goal of this workshop is to bring together researchers and software/hardware designers from academia, industry and national laboratories who are involved in creating network-based computing solutions for extreme scale architectures. The objectives of this workshop will be to share the experiences of the members of this community and to learn the opportunities and challenges in the design trends for exascale communication architectures.

All times in Central European Summer Time (CEST)

Workshop Program

2:00PM-2:05PM

Opening Remarks

Hari Subramoni, Aamir Shafi, and Dhabaleswar K (DK) Panda, The Ohio State University

2:05PM-2:45PM

Keynote

Speaker: Sadaf Alam,, University of Bristol, UK

Session Chair: Dhabaleswar K (DK) Panda, The Ohio State University

Title: Maximising Sustainability of Isambard AI Exascale Supercomputing Platform, from Data Centre to Compute Nodes

Abstract: Isambard AI is one the national UK’s Artificial Intelligence Research Resources (RR) that will offer Exascale AI compute capabilities. The AI RR will be available to research communities aligned with the stated mission of investigating safety and trustworthiness of AI models, large language models (LLMs) and foundational AI topics that are expected to significantly influence sciences and our societies. Since AI compute is highly demanding, reported to take several tens to hundreds of thousands of GPU hours to train LLMs, it is imperative that these systems are designed with sustainability in mind as we face climate emergency. This talk overviews sustainability and performance features of Isambard AI, which have been our guiding principles from designing the data centre to individual computing node solutions. Isambard AI exascale platform is deployed in a modular, containerised data centre. Direct liquid cooled Cray HPE EX cabinets offer maximum power efficiencies and a small physical footprint. Nvidia Grace-Hopper GH200 superchips are optimised for energy efficiency for data movement in addition to AI compute horsepower. Overall, we carefully consider University of Bristol Net Zero by 2030 target and report on scope 1, 2 and 3 emissions. The talk will include updates on Isambard AI phase 1 that was installed from zero (no data centre) to running AI workloads in less than 6 months.

Speaker Bio: Sadaf Alam is Chief Technology Officer (CTO) for Isambard supercomputing Digital Research Infrastructures (DRIs) and director of strategy and academia in the Advanced Computing Research Centre at the University of Bristol, UK. She is responsible for digital transformation for research computing and data assets management services. Prior to joining Bristol, Dr Alam was the CTO at CSCS, the Swiss National Supercomputing Centre. She was chief architect for two generations of Piz Daint innovative flagship supercomputing facilities and MeteoSwiss operational weather forecasting platforms. She was technical lead for European supercomputing centres’ federation project called Fenix. From 2004-2009 Dr Alam was a computer scientist at Oak Ridge National Laboratory (ORNL), USA, and a staff scientist at the ORNL Leadership Computing Facility (OLCF). She studied computer science at the University of Edinburgh, UK, where she received her PhD. She was a founding member of the Swiss Chapter of Women in HPC.

2:45PM-3:10PM

Speaker: Gilad Shainer, NVIDIA

Session Chair: TBD

Title: Entering A New Frontier of AI Networking Innovation

Abstract: NVIDIA networking technologies are designed for training AI at scale. In-network computing, highly effective bandwidth, and noise isolation capabilities have facilitated the creation of larger and more complex foundational models. We'll dive deep into the recent technology announcements and their essential roles in next-generation AI data center designs.

Speaker Bio:Gilad Shainer serves as senior vice president of networking at NVIDIA, focusing on high-performance computing and artificial intelligence. He holds multiple patents in the field of high-speed networking. Gilad Shainer holds an M.S. and a B.S. in electrical engineering from the Technion Institute of Technology in Israel.

3:10PM-3:35PM

Speaker: Brian Smith, Cornelis Networks

Session Chair: Dhabaleswar K (DK) Panda, The Ohio State University

Title:

Abstract:

Speaker Bio:

3:35PM-4:00PM

Speaker: Murali Krishna Emani, Argonne National Laboratory

Session Chair: Hari Subramoni, The Ohio State University

Title: Toward a Holistic Performance Evaluation of Large Language Models Across Diverse AI Accelerators

Abstract: Artificial intelligence (AI) methods have become critical in scientific applications to help accelerate scientific discovery. Large language models (LLMs) are being considered a promising approach to address some challenging problems because of their superior generalization capabilities across domains. The effectiveness of the models and the accuracy of the applications are contingent upon their efficient execution on the underlying hardware infrastructure. Specialized AI accelerator hardware systems have recently become available for accelerating AI applications. However, the comparative performance of these AI accelerators on large language models has not been previously studied. In this work, we systematically study LLMs on multiple AI accelerators along with GPUs and evaluate their performance characteristics for these models. We evaluate these systems with (i) a micro-benchmark using a core transformer block, (ii) a GPT-model, and (iii) an LLM-driven science use case, GenSLM. I will present our findings and analyses of the models performance to better understand the intrinsic capabilities of AI accelerators in this benchmarking effort.

Speaker Bio: Murali Emani is a Computer Scientist in the AI/ML group at the Argonne Leadership Computing Facility (ALCF). Murali obtained a PhD from University of Edinburgh, UK. His research interests are in Scalable Machine Learning, Emerging HPC/AI architectures, Performance optimization and benchmarking. At ALCF, he co-leads the AI Testbed to help evaluate the performance, efficiency of AI accelerators for scientific machine learning applications. He also co-chairs the MLPerf HPC group at MLCommons, to benchmark large scale ML on HPC systems.

4:00PM-4:30PM

Break

4:30PM-4:55PM

Speaker:

Session Chair: Hari Subramoni, The Ohio State University

Title:

Abstract:

Speaker Bio:

4:55PM-5:20PM

Speaker: Shrijeet Mukherjee, Enfabrica

Session Chair: Hari Subramoni, The Ohio State University

Title: Foundational Networking Silicon for the Accelerated Computing Era

Abstract: The past two decades of data center networking has been dominated by expansion of scale-out design principles. The goal was to create large domains of compute built out of simple, small homogenous units of CPUs and associated I/O with a contained blast radius. The inefficiency of distributing all computing into these uniform shards was compensated by overprovisioning the number of shards and prioritizing stability over efficiency. Other than making each compute “stovepipe” faster, the remaining problem left was to perform efficient shard distribution, which is where most of the last decade plus has been focused on (e.g. SmartNICs).

Fast forward to 2024, where data centers have been overrun by an explosion in artificial intelligence and accelerated computing workloads, forcing a rethink of the networking designs that dominated the previous decade. Enfabrica is designing a new networking architecture, manifested in the Accelerated Compute Fabric (ACF) superNIC chip - purpose-built for addressing the needs of high-performance, heterogeneous, distributed computing, where bandwidth and latency are precious resources, and stranding is neither technically nor economically viable.

The talk will cover the basic components of the ACF superNIC design and the system-level innovations and benefits of this architecture over conventional networking designs.

Speaker Bio: Shrijeet is Co-Founder and Chief Technology Officer of Enfabrica where he is responsible for overseeing a new class of technology aimed to address the needs of a modern heterogeneous and composable computer architecture. Prior to founding Enfabrica, he dedicated over three decades building large distributed computing systems. He started out by building large NUMA graphics systems during his tenure at SGI where he played a pivotal role in the development of the first floating point GPU before it was defined as such. Shrijeet led the NIC and virtualization groups while at Cisco Unified Computing System, and spearheaded the development of the SmartNIC, later recognized as DPUs. Later, he served as the VP of Engineering of Cumulus Networks shepherding the Open Networking revolution and built high-performance routers and switches.

Shrijeet is on the Linux NetDev Society Board of Directors and has over 40 patents. He holds an MS in Computer Science from the University of Oregon.

5:20PM-6:00PM

Panel Moderator: Nectarios Koziris, National Technical University of Athens, Greece

Title: Do we need special-purpose networking technologies for handling AI workloads?

Summary:

Moderator Bio:

Panelists:

6:00-6:05

Closing Remarks

Hari Subramoni, Aamir Shafi, and Dhabaleswar K (DK) Panda, The Ohio State University