Extreme Scale computing in HPC, Big Data, Deep Learning and Clouds are marked by multiple-levels of hierarchy and heterogeneity ranging from the compute units (many-core CPUs, GPUs, APUs etc) to storage devices (NVMe, NVMe over Fabrics etc) to the network interconnects (InfiniBand, High-Speed Ethernet, Omni-Path etc). Owing to the plethora of heterogeneous communication paths with different cost models expected to be present in extreme scale systems, data movement is seen as the soul of different challenges for exascale computing. On the other hand, advances in networking technologies such as NoCs (like NVLink), RDMA enabled networks and the likes are constantly pushing the envelope of research in the field of novel communication and computing architectures for extreme scale computing. The goal of this workshop is to bring together researchers and software/hardware designers from academia, industry and national laboratories who are involved in creating network-based computing solutions for extreme scale architectures. The objectives of this workshop will be to share the experiences of the members of this community and to learn the opportunities and challenges in the design trends for exascale communication architectures.

All times in Central European Summer Time (CEST)

Workshop Program

2:00PM-2:05PM

Opening Remarks

Hari Subramoni, Aamir Shafi, and Dhabaleswar K (DK) Panda, The Ohio State University

2:05PM-2:45PM

Keynote

Speaker: Sadaf Alam,, University of Bristol, UK

Session Chair: Dhabaleswar K (DK) Panda, The Ohio State University

Title: Maximising Sustainability of Isambard AI Exascale Supercomputing Platform, from Data Centre to Compute Nodes

Abstract: Isambard AI is one the national UK’s Artificial Intelligence Research Resources (RR) that will offer Exascale AI compute capabilities. The AI RR will be available to research communities aligned with the stated mission of investigating safety and trustworthiness of AI models, large language models (LLMs) and foundational AI topics that are expected to significantly influence sciences and our societies. Since AI compute is highly demanding, reported to take several tens to hundreds of thousands of GPU hours to train LLMs, it is imperative that these systems are designed with sustainability in mind as we face climate emergency. This talk overviews sustainability and performance features of Isambard AI, which have been our guiding principles from designing the data centre to individual computing node solutions. Isambard AI exascale platform is deployed in a modular, containerised data centre. Direct liquid cooled Cray HPE EX cabinets offer maximum power efficiencies and a small physical footprint. Nvidia Grace-Hopper GH200 superchips are optimised for energy efficiency for data movement in addition to AI compute horsepower. Overall, we carefully consider University of Bristol Net Zero by 2030 target and report on scope 1, 2 and 3 emissions. The talk will include updates on Isambard AI phase 1 that was installed from zero (no data centre) to running AI workloads in less than 6 months.

Speaker Bio: Dr Sadaf R. Alam is the University of Bristol's Director of Advanced Computing Strategy. Sadaf joined Bristol University in 2022 from the Swiss National Supercomputing Centre (CSCS) where she was the Chief Technology Officer (CTO). Dr. Alam studied computer science at the University of Edinburgh, UK, where she received her Ph.D. Until March 2009, she was a computer scientist at the Oak Ridge National Laboratory, USA.

Sadaf ensures end-to-end integrity of HPC systems and storage solutions and leads strategic projects at the centre. She has held several different roles across her career including group lead of future systems, chief architect and head of operations. She is a member of ACM, ACM-W, SIGHPC and Women in HPC, and was the technical chair of the world Supercomputing conference SC22.

Sadaf was the chief architect of multiple generations of Piz Daint supercomputing platforms, which is one of Europe’s fastest and among the top 3 supercomputers in the world for many years, and also chief architect of the MeteoSwiss innovative, co-designed operational numerical weather forecasting platforms.

2:45PM-3:10PM

Speaker: Gilad Shainer, NVIDIA

Session Chair: TBD

Title: Entering A New Frontier of AI Networking Innovation

Abstract: NVIDIA networking technologies are designed for training AI at scale. In-network computing, highly effective bandwidth, and noise isolation capabilities have facilitated the creation of larger and more complex foundational models. We'll dive deep into the recent technology announcements and their essential roles in next-generation AI data center designs.

Speaker Bio:Gilad Shainer serves as senior vice president of networking at NVIDIA, focusing on high-performance computing and artificial intelligence. He holds multiple patents in the field of high-speed networking. Gilad Shainer holds an M.S. and a B.S. in electrical engineering from the Technion Institute of Technology in Israel.

3:10PM-3:35PM

Speaker: Douglas Fuller, Cornelis Networks

Session Chair: Dhabaleswar K (DK) Panda, The Ohio State University

Title: Software Updates and Ideas for CN5000

Abstract:In late 2024, Cornelis Networks will release its next-generation network fabric titled “CN5000.” This presentation will take a deep dive into the software stack and its integration with MPI via libfabric. We will discuss the changes needed to support additional features as well as the plan for integration with the upstream community. We will also propose some new ideas for how MPI and/or other middleware can take advantage of some of these new features more explicitly if desired.

Speaker Bio: Douglas Fuller is the director of software development at Cornelis Networks. Doug joined Cornelis from Red Hat, where he served as a software engineering manager leading teams working on the Ceph distributed storage system. Doug’s career in HPC has included stints at various universities and Oak Ridge National Laboratory.

Doug holds bachelor's and master's degrees in computer science from Iowa State University. His master's work at DOE Ames Laboratory involved early one-sided communication models in supercomputers. From his undergraduate days, he remains keenly aware of the critical role of floppy diskettes in Beowulf cluster administration.Doug holds bachelor's and master's degrees in computer science from Iowa State University. His master's work at DOE Ames Laboratory involved early one-sided communication models in supercomputers. From his undergraduate days, he remains keenly aware of the critical role of floppy diskettes in Beowulf cluster administration.

3:35PM-4:00PM

Speaker: Murali Krishna Emani, Argonne National Laboratory

Session Chair: Hari Subramoni, The Ohio State University

Title: Toward a Holistic Performance Evaluation of Large Language Models Across Diverse AI Accelerators

Abstract: Artificial intelligence (AI) methods have become critical in scientific applications to help accelerate scientific discovery. Large language models (LLMs) are being considered a promising approach to address some challenging problems because of their superior generalization capabilities across domains. The effectiveness of the models and the accuracy of the applications are contingent upon their efficient execution on the underlying hardware infrastructure. Specialized AI accelerator hardware systems have recently become available for accelerating AI applications. However, the comparative performance of these AI accelerators on large language models has not been previously studied. In this work, we systematically study LLMs on multiple AI accelerators along with GPUs and evaluate their performance characteristics for these models. We evaluate these systems with (i) a micro-benchmark using a core transformer block, (ii) a GPT-model, and (iii) an LLM-driven science use case, GenSLM. I will present our findings and analyses of the models performance to better understand the intrinsic capabilities of AI accelerators in this benchmarking effort.

Speaker Bio: Murali Emani is a Computer Scientist in the AI/ML group at the Argonne Leadership Computing Facility (ALCF). Murali obtained a PhD from University of Edinburgh, UK. His research interests are in Scalable Machine Learning, Emerging HPC/AI architectures, Performance optimization and benchmarking. At ALCF, he co-leads the AI Testbed to help evaluate the performance, efficiency of AI accelerators for scientific machine learning applications. He also co-chairs the MLPerf HPC group at MLCommons, to benchmark large scale ML on HPC systems.

4:00PM-4:30PM

Break

4:30PM-4:55PM

Speaker: Debendra Das Sharma, Intel

Session Chair: Hari Subramoni, The Ohio State University

Title: Compute Express Link (CXL*): An open interconnect for HPC and AI applications

Abstract: High-performance workloads demand heterogeneous processing, high memory bandwidth, and load-store based interconnect infrastructure to meet the demands of the computing landscape in the era of HPC and generative AI.

CXL is a dynamic multi-protocol interconnect designed to support heterogeneous computing, memory expansion, and a load-store fabric interconnect with applications in HPC, AI, Enterprise, and Cloud Computing segments. CXL is currently working towards the 4th generation of the specification as the products based on first generation are being deployed in the data centers. In this talk, we will discuss the progress we have made and the challenges we must meet to continue to be relevant in the HPC and AI space.

Speaker Bio: Dr. Debendra Das Sharma is an Intel Senior Fellow and co-GM of Memory and I/O Technologies, Data Platforms and Artificial Intelligence Group, at Intel Corporation. He is a leading expert on I/O subsystem and interface architecture. He delivers Intel-wide critical interconnect technologies in Peripheral Component Interconnect Express (PCIe), Compute Express Link (CXL), Universal Chiplet Interconnect Express (UCIe), and Intel’s Coherency interconnect, as well as their implementation.

Dr. Das Sharma is a member of the Board of Directors and treasurer for the PCI Special Interest Group (PCI-SIG). He has been a lead contributor to PCIe specifications since its inception. He is the co-inventor of CXL and a founding member of the CXL consortium. He co-leads the CXL Board Technical Task Force, and is a leading contributor to CXL specifications. He co-invented the chiplet interconnect standard UCIe and is the chair of the UCIe consortium.

Dr. Das Sharma has a bachelor’s in technology (with honors) degree in Computer Science and Engineering from the Indian Institute of Technology, Kharagpur and a Ph.D. in Computer Engineering from the University of Massachusetts, Amherst. He holds 190+ US patents and 500+ patents world-wide. He is a frequent keynote/ plenary speaker, distinguished lecturer, invited speaker, and panelist at the IEEE International Test Conference, IEEE Hot Interconnects, IEEE Cool Chips, IEEE 3DIC, SNIA SDC, PCI-SIG Developers Conference, CXL consortium, Open Server Summit, Open Fabrics Alliance, Flash Memory Summit, Intel Innovation, and Universities (CMU, Texas A&M, Georgia Tech, UIUC, UC Irvine). He has been awarded the Distinguished Alumnus Award from Indian Institute of Technology, Kharagpur in 2019, the IEEE Region 6 Outstanding Engineer Award in 2021, the first PCI-SIG Lifetime Contribution Award in 2022, the IEEE Circuits and Systems Industrial Pioneer Award in 2022, and the IEEE Computer Society Edward J. McCluskey Technical Achievement Award in 2024.

4:55PM-5:20PM

Speaker: Shrijeet Mukherjee, Enfabrica

Session Chair: Hari Subramoni, The Ohio State University

Title: Foundational Networking Silicon for the Accelerated Computing Era

Abstract: The past two decades of data center networking has been dominated by expansion of scale-out design principles. The goal was to create large domains of compute built out of simple, small homogenous units of CPUs and associated I/O with a contained blast radius. The inefficiency of distributing all computing into these uniform shards was compensated by overprovisioning the number of shards and prioritizing stability over efficiency. Other than making each compute “stovepipe” faster, the remaining problem left was to perform efficient shard distribution, which is where most of the last decade plus has been focused on (e.g. SmartNICs).

Fast forward to 2024, where data centers have been overrun by an explosion in artificial intelligence and accelerated computing workloads, forcing a rethink of the networking designs that dominated the previous decade. Enfabrica is designing a new networking architecture, manifested in the Accelerated Compute Fabric (ACF) superNIC chip - purpose-built for addressing the needs of high-performance, heterogeneous, distributed computing, where bandwidth and latency are precious resources, and stranding is neither technically nor economically viable.

The talk will cover the basic components of the ACF superNIC design and the system-level innovations and benefits of this architecture over conventional networking designs.

Speaker Bio: Shrijeet is Co-Founder and Chief Technology Officer of Enfabrica where he is responsible for overseeing a new class of technology aimed to address the needs of a modern heterogeneous and composable computer architecture. Prior to founding Enfabrica, he dedicated over three decades building large distributed computing systems. He started out by building large NUMA graphics systems during his tenure at SGI where he played a pivotal role in the development of the first floating point GPU before it was defined as such. Shrijeet led the NIC and virtualization groups while at Cisco Unified Computing System, and spearheaded the development of the SmartNIC, later recognized as DPUs. Later, he served as the VP of Engineering of Cumulus Networks shepherding the Open Networking revolution and built high-performance routers and switches.

Shrijeet is on the Linux NetDev Society Board of Directors and has over 40 patents. He holds an MS in Computer Science from the University of Oregon.

5:20PM-6:00PM

Panel Moderator: Nectarios Koziris, National Technical University of Athens, Greece

Title: Do we need special-purpose networking technologies for handling AI workloads?

Summary:

Moderator Bio: Nectarios Koziris is a Professor of Computer Science and the Dean of the School of Electrical and Computer Engineering at the National Technical University of Athens. His research interests include parallel and distributed systems, interaction between compilers, OS and architectures, datacenter hyperconvergence, scalable data management and large scale storage systems. He has co-authored more than 180 research papers with more than 4800 citations (h-index:30). From 1998 he has been involved in the organization of many international scientific conferences including IPDPS, ICPP, SC, SPAA, etc. He has given many invited talks in conferences and universities in Europe and USA. He is a recipient of two best paper awards for his research in parallel and distributed computing (IEEE/ACM IPDPS 2001 and CCGRID 2013) and had received honorary recognition from Intel (2015) for his research and insightful contributions in transactional memory (TSX synchronization extensions). Nectarios has served as the Vice-Chair for the Greek Research and Technology Network-GRNET. He was the founder of the ~okeanos project, a public Cloud IaaS infrastructure, among the biggest ones in the European public sector (topping out beyond 10.000 active VMs), powered by the open source Synnefo software. He is an advisor to Arrikto Inc., a fresh startup based in Palo Alto, California, that develops storage intelligence to access, manage, and store data in large-scale, heterogeneous and hybrid environments.

Panelists:

6:00-6:05

Closing Remarks

Hari Subramoni, Aamir Shafi, and Dhabaleswar K (DK) Panda, The Ohio State University