Programming exascale systems was seen as a major challenge at the start of the efforts to reach that level of performance. Perhaps not surprisingly, despite predictions of the likely dominance of new languages, users of DOE exascale systems still rely heavily on the MPI + OpenMP model that has dominated HPC for several years. Even emerging C++ abstraction layers such as Kokkos and RAJA often use the familiar MPI + OpenMP model in their backends. Thus, this talk will describe the implementation of the MPI + OpenMP model on the El Capitan and Frontier DOE exascale systems as while as how OpenMP has evolved, and will continue to evolve, to remain a key part of the large-scale programming ecosystem.
As Chief Technology Officer (CTO) for Livermore Computing (LC) at Lawrence Livermore National Laboratory (LLNL), Bronis R. de Supinski formulates LLNL's large-scale computing strategy and oversees its implementation. He frequently interacts with supercomputing leaders and oversees many collaborations with industry and academia. Previously, Bronis led several research projects in LLNL's Center for Applied Scientific Computing. He actively continues to engage in research, particularly in parallel programming models, and serves as the Chair of the OpenMP Language Committee. He earned his Ph.D. in Computer Science from the University of Virginia in 1998 and he joined LLNL in July 1998. In addition to his work with LLNL, Bronis is a Professor of Exascale Computing at Queen's University of Belfast and an Adjunct Associate Professor in the Department of Computer Science and Engineering at Texas A&M University. Throughout his career, Bronis has won several awards, including the prestigious Gordon Bell Prize in 2005 and 2006, as well as two R&D 100s. He is an IEEE Fellow.
The Message Passing Interface (MPI) is the most dominant programming model on HPC systems and has been instrumental in developing efficient, large scale parallel applications. However, it has a rather static view of compute resources building on top of the concept of immutable communicators. While this provides some easy-of-use and simplicity, it is limiting, in particular for modern workflow-based workloads as well as in its support for resource adaptive systems. The newly introduced concept of MPI Sessions, however, opens the door more dynamicity and adaptivity. In this talk I will highlight the opportunities that can arise from such directions and discuss a novel approaches we are pursuing as part of several EuroHPC projects. Our ultimate goal is to provide full malleability in MPI as well as the surrounding software layers - from system software to applications - and with that enable us to more efficiently harness the computational capabilities of current and future HPC systems.
Martin Schulz is a Full Professor and Chair for Computer Architecture and Parallel Systems at the Technische Universität München (TUM), which he joined in 2017, as well as a member of the board of directors at the Leibniz Supercomputing Centre. Prior to that, he held positions at the Center for Applied Scientific Computing (CASC) at Lawrence Livermore National Laboratory (LLNL) and Cornell University. He earned his Doctorate in Computer Science in 2001 from TUM and a Master of Science in Computer Science from UIUC. Martin's research interests include parallel and distributed architectures and applications; performance monitoring, modeling and analysis; memory system optimization; parallel programming paradigms; tool support for parallel programming; power-aware parallel computing; and fault tolerance at the application and system level, as well as quantum computing and quantum computing architectures and programming, with a special focus on HPC and QC integration.
Increasing workload fidelity and achieving faster time to solution has required the deployment of the world's first exascale systems. However, the scale of these systems present programming challenges due to greatly increased parallelism and heterogeneity. This talk details early performance results at scale on systems such as Frontier using a variety of techniques across HPC and Machine Learning, such as MPI and RCCL. We conclude with a discussion of the significance and impact of programming models on applications with GPUs/accelerators during the post-exascale era.
Nicholas Malaya is a Principal Engineer at AMD, where he is AMD's technical lead for exascale application performance. Nick's research interests include HPC, computational fluid dynamics, Bayesian inference, and ML/AI. He received his PhD from the University of Texas. Before that, he double majored in Physics and Mathematics at Georgetown University, where he received the Treado medal. In his copious spare time he enjoys motorcycles, long distance running, wine, and spending time with his wife and children.
The move to larger, more powerful, compute nodes on large scale HPC systems has been significant in recent years. It's not uncommon for nodes now to have 128+ computational cores, and significant amount of GPU resources. This provides potential scope for active middleware to run on these nodes, managing anything from storage and I/O to compute kernels and network traffic. However, there needs to be a stronger understanding of the impact of on-node workloads on application performance, especially when we are aiming to scale to Exascale systems with many millions of workers. In this talk I will discuss work we are doing to evaluate and characterise the impact of on-node workloads, and explore some of the active middleware that could enable scaling up to very large node and system sizes without requiring significant user application changes
Adrian Jackson is a senior research fellow at EPCC, The University of Edinburgh. He has a long history of research in HPC, from optimising applications for specific hardware through to designing and developing middleware for new systems. He has spent recent years researching non-volatile memory and advanced data storage systems, and also heavily involved in hardware optimisation projects, such as the UK Excalibur FPGA testbed in the UK. Research experience across a range of different areas in high performance computing provides the basis for current work on compute node performance capacity and automatic communication libraries designed to exploit network capacity for improved application performance.
Parallel programming for extreme scale computing is hard. Couple that with heterogeneous processors across the system and it becomes even harder. Add to the mix that modern programmers are not being trained to understand how algorithms map onto the features of hardware, and it becomes harder still. Throw in that software outlives hardware so a single codebase must work across a wide range of different systems, and we arrive at programming challenges at an extreme scale. In this talk we will propose pragmatic solutions to these challenges; solutions that will support high programmer productivity to generate codebases that are performant and portable.
Tim Mattson is a parallel programmer obsessed with every variety of science (Ph.D. Chemistry, UCSC, 1985). He is a senior principal engineer in Intel’s parallel computing lab. Tim has been with Intel since 1993 and has worked with brilliant people on great projects including: (1) the first TFLOP computer (ASCI Red), (2) MPI, OpenMP and OpenCL, (3) two different research processors (Intel's TFLOP chip and the 48 core SCC), (4) Data management systems (Polystore systems and Array-based storage engines), and (5) the GraphBLAS API for expressing graph algorithms as sparse linear algebra. Tim has over 150 publications including five books on different aspects of parallel computing, the latest (Published November 2019) titled “The OpenMP Common Core: making OpenMP Simple Again”.
Sparse linear algebra routines are fundamental building blocks of a large variety of scientific applications. Direct solvers, which are methods for solving linear systems via the factorization of matrices into products of triangular matrices, are commonly used in many contexts. The Cholesky factorization is the fastest direct method for symmetric and definite positive matrices. This paper presents selective nesting, a method to determine the optimal task granularity for the parallel Cholesky factorization based on the structure of sparse matrices. We propose the OPT-D algorithm, which automatically and dynamically applies selective nesting. OPT-D leverages matrix sparsity to drive complex task- based parallel workloads in the context of direct solvers. We run an extensive evaluation campaign considering a heterogeneous set of 35 sparse matrices and a parallel machine featuring the A64FX processor. OPT-D delivers an average performance speedup of 1.46x with respect to the best state-of-the-art parallel method to run direct solvers.
Octo-Tiger, a large-scale 3D AMR code for the merger of stars, uses a combination of HPX, Kokkos and explicit SIMD types, aiming to achieve performance-portability for a broad range of heterogeneous hardware. However, on A64FX CPUs, we encountered several missing pieces, hindering performance by causing problems with the SIMD vectorization. Therefore, we add std::experimental::simd as an option to use in Octo-Tiger's Kokkos kernels alongside Kokkos SIMD, and further add a new SVE (Scalable Vector Extensions) SIMD backend. Additionally, we amend missing SIMD implementations in the Kokkos kernels within Octo-Tiger's hydro solver. We test our changes by running Octo-Tiger on three different CPUs: An A64FX, an Intel Icelake and an AMD EPYC CPU, evaluating SIMD speedup and node-level performance. We get a good SIMD speedup on the A64FX CPU, as well as noticeable speedups on the other two CPU platforms. However, we also experience a scaling issue on the EPYC CPU.
APEX (Autonomic Performance Environment for eXascale) is a performance measurement library for distributed, asynchronous multitasking runtime systems. It provides support for both lightweight measurement and high concurrency. To support performance measurement in systems that employ user-level threading, APEX uses a dependency chain in addition to the call stack to produce traces and task dependency graphs. APEX also provides a runtime adaptation system based on the observed system performance. In this paper, we describe the evolution of APEX from its design for HPX to support an array of programming models and abstraction layers and describe some of the features that have evolved to help understand the asynchrony and high concurrency of asynchronous tasking models.
Artificial Intelligence (AI) enhances the speed, precision, and effectiveness of many applications and simulations of different fields, including scientific applications and large-scale HPC simulations and models. Recently, researchers have attempted to solve problems related to High-Performance Computing and Cyberinfrastructure, such as Scheduling and Resource Management, Device Mapping and Autotuning, Code Optimization and Compilers, Code Generation and Translation, etc., using AI and specifically Deep Learning. However, a major challenge of this type of research is that Deep Learning methods usually need large datasets, and unlike in other fields, comparatively fewer datasets are available for these tasks. Another major challenge of data-driven HPC research is the representation of the data or code. For example, some primary research questions on data-driven Code and Compiler Optimization remain unanswered: “Can there be a UNIVERSAL REPRESENTATION for code that will perform well for all tasks, or do we need to have different representations for multiple optimizations? Can DL models learn ENOUGH without any dynamic or profiling information? Can DL models learn from all the IMBALANCED and mostly UNLABELED data?”. This panel aims to identify and discuss the challenges and opportunities for applying Deep Learning to HPC. It presents a stimulating environment where the community can discuss topics relevant to HPC and AI. The panel intends to initiate research collaborations and provides an opportunity to receive feedback and opinions from domain experts and discover new ideas, directions, and potential solutions in data-driven HPC research.
Ali Jannesari is an Assistant Professor with the Computer Science Department at Iowa State University. He is the Director of the Software Analytics and Pervasive Parallelism Lab at ISU. His research primarily focuses on the intersection of high-performance computing (HPC) and data science. Prior to joining the faculty at ISU, he was a Senior Research Fellow at the University of California, Berkeley. He was in charge of the Multicore Programming Group at the Technical University of Darmstadt and a junior research group leader at RWTH Aachen University. He worked as a PostDoc fellow at Karlsruhe Institute of Technology and Bosch Research Center, Munich. Jannesari has published more than seventy refereed articles, several of which have received awards. He has received research funding from multiple European and US funding agencies. He holds a Habilitation degree from TU Darmstadt and received his Ph.D. degree in Computer Science from Karlsruhe Institute of Technology.
Ali Jannesari, Iowa State University (Moderator)
Vipin Chaudhary, Case Western Reserve University
Mary Hall, University of Utah
Torsten Hoefler, ETH Zurich, Switzerland
Dong Li, University of California, Merced