Aurora is an exascale supercomputer in the final stages of assembly at the Argonne Leadership Computing Facility (ALCF) in the U.S. This talk will focus on the Aurora hardware and software architectures with emphasis on the interconnect and programming models, and their impact on application performance and scalability.
Dr Kalyan Kumaran is a Senior Computer Scientist and Director of Technology at the Argonne Leadership Computing Facility. He leads the Non-Recurring Engineering (NRE) collaboration with Intel to develop the hardware and software stack for Aurora, Argonne’s first exascale computer. He is an expert on performance-related activities and one of the lead architects from Argonne for their recent systems.
GPU-centric accelerated supercomputing is still on the main stream for HPC and AI applications. However, in the next generation's systems, we need to consider wider variety of accelerators in different style of systems from the architecture level. On such complicated systems, what is the best way of programming keeping the balance between programmability/productivity and performance ? We have been working on the multi-hetero accelerated environment to combine GPU and FPGA in a single platform to apply complicated multiphysics applications with 360-degree manner of utilization of accelerating devices. There are several approaches from the naive implementation to the high-level directive-base approach. In this talk, I will present the programming model, supporting language system, and target applications with the implementation on a real system.
Taisuke Boku has been researching HPC system architecture, system software, and performance evaluation on various scientific applications after he received PhD degree of Electrical Engineering from Keio University, Japan. He is currently the director of Center for Computational Sciences, University of Tsukuba, a co-designing center with both application researchers and HPC system researchers. He has been playing a central roles for development of original supercomputers in the center including CP-PACS (ranked as number one in TOP500 in 1996), FIRST, PACS-CS, HA-PACS and Cygnus systems, the representative supercomputers in Japan. The recent system Cygnus is the world first multi-hybrid accelerated system with GPU and FPGA together. He has been the President of HPCI (High Performance Computing Infrastructure) Consortium in Japan in 2020-2022. He was a member of system architecture working group of Fugaku supercomputer development. He received ACM Gordon Bell Prize in 2011.
Modern extreme scale computing systems rely on heterogeneous CPU and GPU architectures. While this design has enabled several remarkable achievements in high-performance computing, applications running at exascale have already identified multiple opportunities where this paradigm can be improved; notably, the communication costs, and the complexity of the resultant programming model, incurred by the presence of two isolated memory spaces for CPU and GPU. To address these challenges, AMD has developed the Instinct MI300 APU (Accelerated Processing Unit) architecture, which integrates CPU and GPU processing elements on the same system on a chip (SoC). This talk will discuss programmability advantages, and future possibilities, afforded by the MI300 for Exascale computing, including: the improved simplicity of porting from CPU codes and performance benefits resulting from close integration of CPU and GPU compute elements. These simplifications and improvements are realized in a variety of tools, including the RAJA and Kokkos accelerator abstraction frameworks, a recently developed Standard Parallelism interface to AMD APUs, and automatic offload of libraries.
Nicholas Malaya is a Fellow at AMD, where he is AMD's technical lead for exascale application performance. Nick's research interests include HPC, computational fluid dynamics, Bayesian inference, and ML/AI. He received his PhD from the University of Texas. Before that, he double majored in Physics and Mathematics at Georgetown University, where he received the Treado medal. In his copious spare time he enjoys motorcycles, long distance running, wine, and spending time with his wife and children.
Shuaiwen Leon Song is a senior principal scientist and manager at Microsoft. He leads the effort of Deepspeed4Science initiative to create a broad engagement between Microsoft, Microsoft research, DoE labs, academia and industry partners to enable sophisticated system technology research and development for supporting aspects of training and inference for large-scale AI-driven scientific models. At DeepSpeed, he also drives or co-drives several pathfinding projects and releases (e.g., ZeRO inference, scalable dialogue system design and DeepSpeed Chat) and co-manages the Brainwave team. Prior to Microsoft, he was the SOAR associate professor at University of Sydney and an adjunct professor at University of Washington. His past works in HPC have received several best paper nominations and were featured in U.S. DoE research highlights and other media outlets. He was the recipient of several awards including IEEE early-career award for HPC, IEEE mid-career award for scalable computing, Facebook faculty award, Google brain faculty award, Australian most innovative engineer award, AIR global faculty award. He is also an ACM distinguished speaker.
Moore’s Law is a techno-economic model that has enabled the IT industry to double the performance and functionality of digital electronics roughly every 2 years within a fixed cost, power and area. This expectation has led to a relatively stable ecosystem (e.g. electronic design automation tools, compilers, simulators and emulators) built around general-purpose processor technologies, such as the x86, ARM and Power instruction set architectures. However, the historical improvements in performance offered by successive generations of lithography are waning while costs for new chip generations are growing rapidly. In the near term, the most practical path to continued performance growth will be architectural specialization in the form of many different kinds of accelerators. New software implementations, and in many cases new mathematical models and algorithmic approaches, are necessary to advance the science that can be done with these specialized architecture. This trend will not only continue but also intensify as the transition from multi-core systems to hybrid systems has already caused many teams to re-factor and redesign their implementations. But the next step to systems that exploit not just one type of accelerator but a full range of heterogeneous architectures will require more fundamental and disruptive changes in algorithm and software approaches. This applies to the broad range of algorithms used in simulation, data analysis and learning. New programming models or low-level software constructs that hide the details of the architecture from the implementation can make future programming less time-consuming, but they will not eliminate nor in many cases even mitigate the need to redesign algorithms. Future software development will not be tractable if a completely different code base is required for each different variant of a specialized system.
The aspirational desire for “minimizing the number of lines of code that must be changed to migrate to different systems with different arrangements of specialization” is encapsulated in the loaded phrase “Performance Portability.” However, performance portability is likely not an achievable goal if we attempt to do it using imperative languages like Fortran and C/C++. There is simply not enough flexibility built in to the specification of the algorithm for a compiler to do anything other than what the algorithm designer explicitly stated in their code. To make this future of diverse accelerators usable and accessible in the former case will require the co-design of new compiler technology and domain- specific languages (DSLs) designed around the requirements of the target computational motifs. The higher levels of abstraction and declarative semantics offered by DSLs enable more degrees of freedom to optimally map the algorithms onto diverse hardware than traditional imperative languages that over-prescribe the solution. Because this will drastically increase the complexity of the mapping problem, new mathematics for optimization will be developed, along with better performance introspection (both hardware and software mechanisms for online performance introspection) through extensions to the roofline model. Use of ML/AI technologies will be essential to enable analysis and automation of dynamic optimizations.
John Shalf is Department Head for Computer Science Lawrence Berkeley National Laboratory, and recently was deputy director of Hardware Technology for the DOE Exascale Computing Project. Shalf is a coauthor of over 80 publications in the field of parallel computing software and HPC technology, including three best papers and the widely cited report “The Landscape of Parallel Computing Research: A View from Berkeley” (with David Patterson and others). He also coauthored the 2008 “ExaScale Software Study: Software Challenges in Extreme Scale Systems,” which set the Defense Advanced Research Project Agency’s (DARPA’s) information technology research investment strategy. Prior to coming to Berkeley Laboratory, John worked at the National Center for Supercomputing Applications and the Max Planck Institute for Gravitation Physics/Albert Einstein Institute (AEI) where he was was co-creator of the Cactus Computational Toolkit.
In this talk, I will discuss the multi-stream based execution environment of Habana/Gaudi systems that is exposed to deep learning frameworks and I will show how one can combine compute, networking and DMA at high performance and with low run-time overheads. I will highlight the performance of Habana Collective Communication Library at scale in terms of bandwidth, message rate and demonstrate its impact on deep learning training and inference performance of a few neural network models including vision and Large Language Models. In the second part of the talk, I will highlight the challenges in communication scaling especially the associated congestion that we observe between leaf and spine switches in certain conditions. I will highlight solutions that we are currently deploying including congestion control algorithms and packet/message spraying techniques at the endpoint and share our results.
Karthikeyan Vaidyanathan (Karthik) is a Principal AI Engineer at Intel. His responsibilities include delivering best application performance in large-scale datacenters, defining and optimizing collective communication algorithms, coming up with novel framework-level optimizations and co-designing hardware/software for future scale-out systems. He has made significant contributions to MLPERF submissions, large-scale (up to ~10000 nodes) Top500, Green500 runs enabling Intel to achieve #1 ranking. He is an Intel Achievement Awardee, recipient of Intel Labs Gordy Award, and author of several top-tier conference papers and patents. He received his Ph.D. from The Ohio State University, USA. Karthik is also an adjunct faculty at IIIT, Bangalore teaching a course on network based computing for HPC and Deep Learning.
In conventional multi-GPU configurations, the host manages execution, kernel launches, communication, and synchronization, incurring unnecessary overhead. To mitigate this, we present a CPU-free model that delegates control to the devices themselves, especially benefiting communication-intensive applications. Utilizing techniques such as persistent kernels, specialized thread blocks, and device-initiated communication, we create autonomous multi-GPU code that drastically reduces communication overhead. Our approach is demonstrated with popular solvers, including 2D/3D Jacobian stencil and Conjugate Gradient (CG). We are currently developing its compiler technology, applying the model to a broader set of applications and its debugging/profiling tools.
Didem Unat is a faculty member at Koç University and director of Parallel and Multicore Computing Laboratory. She is the first researcher from Turkey to receive ERC funding from the European Research Council in the field of Computer Science for her project BEYONDMOORE for 2021-2026 period. She is currently acting as the project coordinator of the EuroHPC partnered project of 2.6M €. She was named the “Emerging Woman Leader in Technical Computing'' by ACM SigHPC in 2021, the first recipient of this award outside the US.
She is known for her work on programming models, performance tools, and system software for emerging parallel architectures. She received her PhD degree from the University of California, San Diego, and later the Luis Alvarez Postdoctoral Fellowship from Lawrence Berkeley National Laboratory. She received the Marie Sklodowska-Curie Individual Fellowship from the European Commission in 2015, the BAGEP Award from the Turkish Academy of Sciences in 2019, the British Royal Society Newton Advanced Fellowship in 2020, and the 2021 Scientist of the Year – Young Scientist Awards by Bilim Kahramanları Derneği in Turkey.
Ensuring high productivity in scientific software development necessitates developing and maintaining a single codebase that can run efficiently on a range of accelerator-based supercomputing platforms. This requires the use of performance portability layers such as OpenMP, RAJA, Kokkos and SYCL for developing the compute kernels. In this talk, I will present the results of a comprehensive study of a range of proxy applications implemented in the major programming models suitable for GPU-based platforms. We collect and analyze performance results across NVIDIA and AMD GPU hardware currently deployed in leadership-class computing facilities using a representative set of scientific codes and several programming models -- CUDA, HIP, Kokkos, RAJA, OpenMP, OpenACC, and SYCL. Based on the specific characteristics of applications tested, we discuss recommendations to developers on how to choose the right programming model for their code. These results provide a comprehensive evaluation of the extent to which each programming model for heterogeneous systems provides true performance portability in real-world usage.
Abhinav Bhatele is an associate professor in the department of computer science, and director of the Parallel Software and Systems Group at the University of Maryland, College Park. His research interests are broadly in systems and networks, with a focus on parallel computing and large-scale data analytics. He has published research in parallel programming models and runtimes, network design and simulation, applications of machine learning to parallel systems, parallel deep learning, and on analyzing/visualizing, modeling and optimizing the performance of parallel software and systems. Abhinav has received best paper awards at Euro-Par 2009, IPDPS 2013 and IPDPS 2016. He was selected as a recipient of the IEEE TCSC Young Achievers in Scalable Computing award in 2014, the LLNL Early and Mid-Career Recognition award in 2018, and the NSF CAREER award in 2021.
Programming heterogeneous computing systems is a daunting task which is becoming even more challenging with the advent of emerging, non Von-Neumann computer architectures. Innovation in programming abstractions and compilers are thus badly needed to cope with the current golden age of computer architecture. This talk discusses domain-specific abstractions and languages as a promising avenue to hide the system complexity from non-expert programmers while passing richer information to compilers. The high-level semantics in DSLs improves productivity while enabling coarser-grained optimization and safer code generation. Examples are provided from the domains of big-data, physics simulations and machine learning, targeting modern reconfigurable hardware, for emerging memory technologies and for emerging in-memory computing.
Jeronimo Castrillon is a professor in the Department of Computer Science at the TU Dresden, where he is also affiliated with the Center for Advancing Electronics Dresden (CfAED). He is the head of the Chair for Compiler Construction, with research focus on methodologies, languages, tools and algorithms for programming complex computing systems. He received the Electronics Engineering degree from the Pontificia Bolivariana University in Colombia in 2004, his masters degree from the ALaRI Institute in Switzerland in 2006 and his Ph.D. degree (Dr.-Ing.) with honors from the RWTH Aachen University in Germany in 2013. In 2014, Prof. Castrillon co-founded Silexica GmbH/Inc, a company that provides programming tools for heterogeneous architectures, now with Xilinx/AMD.
In this panel, we focus on the challenges in programming models and runtime system for large language model training/inference. We invite researchers across academia, national labs, and industry to share their experience and vision on programming tools, runtime performance, architecture, optimization, scalability, I/O, data, and communication to facilitate LLMs on supercomputers. The discussion will cover LLM pretraining, fine-tuning, deployment, and usage in science. We will identify the Top 5 challenges across these areas.
Dr. Zhao Zhang is an assistant professor in the Department of Electrical and Computer Engineering at Rutgers University. He has extensive experience in high performance computing (HPC) and big data systems. His recent research focus is the fusion of HPC and deep learning (DL) with a wide range of topics of optimization algorithm, I/O, architecture, and domain applications. Dr. Zhang co-leads the CI4AI thrust in the NSF ICICLE AI Institute. His research in scalable neural network optimization and AI cyberinfrastructure is funded by multiple NSF awards.
Torsten Hoefler, ETH Zurich
Leon Song, Microsoft
Rick Stevens, University of Chicago
Rio Yokota, Tokyo Institute of Technology, Japan
Zhao Zhang, Rutgers University (Moderator)