(Virtual Workshop this year)
The fifth edition of ESPM2 workshop, being proposed to be held as a full-day meeting with the Supercomputing (SC'2020) conference in Atlanta, Georgia focuses on programming models and runtimes for extreme scale systems. Next generation architectures and systems being deployed are characterized by high concurrency, low memory per-core, and multiple levels of hierarchy and heterogeneity. These characteristics bring out new challenges in energy efficiency, fault-tolerance and, scalability. It is commonly believed that software has the biggest share of the responsibility to tackle these challenges. In other words, this responsibility is delegated to the next generation programming models and their associated middleware/runtimes. This workshop focuses on different aspects of programming models such as task-based parallelism (Charm++, OCR, Habanero, Legion, X10, HPX, etc), PGAS (OpenSHMEM, UPC, CAF, Chapel, UPC++, etc.), BigData (Hadoop, Spark, Dask etc), Machine Learning (NVIDIA RAPIDS, Scikit-learn etc.), Deep Learning (Caffe, Microsoft CNTK, Google TensorFlow, Facebook PyTorch), directive-based languages (OpenMP, OpenACC) and Hybrid MPI+X, etc. It also focuses on their associated middleware (unified runtimes, interoperability for hybrid programming, tight integration of MPI+X, and support for accelerators/FPGAs) for next generation systems and architectures.
The ultimate objective of the ESPM2 workshop is to serve as a forum that brings together researchers from academia and industry working in the areas of programming models, runtime systems, compilation and languages, and application developers.
As extreme scale platforms, both for HPC and Cloud Computing, continue the decade long shift towards an architectural configuration that greatly favors local, small memory footprint computation, it is time to consider spending a small amount of these excellent local compute and memory resources on efforts to increase programmability and provide for a wider programmer base. This is especially the case for computations which inherently require either significant non-local interaction and/or large memory footprints. This talk will highlight the work of the speaker and others on computational techniques using novel approaches to remote data manipulation, such as Actors and Conveyors; using techniques that employ aggregation and asynchrony; in programming languages such as JavaScript and Rust.
William Carlson is a member of the research staff at the IDA Center for Computing Sciences where, since 1990, his focus has been on applications and system tools for large-scale parallel and distributed computers. He also leads the UPC language effort, a consortium of industry and academic research institutions aiming to produce a unified approach to parallel C programming based on global address space methods. Dr. Carlson graduated from Worcester Polytechnic Institute in 1981 with a BS degree in Electrical Engineering. He then attended Purdue University, receiving the MSEE and Ph.D. degrees in Electrical Engineering in 1983 and 1988, respectively. From 1988 to 1990, Dr. Carlson was an Assistant Professor at the University of Wisconsin-Madison, where his work centered on performance evaluation of advanced computer architectures.
The landscape of high performance computing is shifting towards a collection of multi-GPU nodes, widening the gap between on-node compute and off-node communication capabilities. Consequently, the ability to tolerate communication latencies and maximize utilization of the compute hardware are becoming increasingly important in achieving high performance. Overdecomposition has been successfully adopted on traditional CPU-based systems to achieve computation-communication overlap, significantly reducing the impact of communication on application performance. However, it has been unclear whether overdecomposition can provide the same benefits on modern GPU systems. In this work, we address the challenges in achieving computation-communication overlap with overdecomposition on GPU systems using the Charm++ parallel programming system. By prioritizing communication with CUDA streams in the application and supporting asynchronous progress of GPU operations in the Charm++ runtime system, we obtain improvements in overall performance of up to 50% and 47% with proxy applications Jacobi3D and MiniMD, respectively.
Major simultaneous disruptions are currently under way in both hardware and software as we consider the implications for future extreme scale programming models and middleware. In hardware, “extreme heterogeneity” has become critical to sustaining cost and performance improvements after Moore’s Law, but poses significant productivity and portability challenges for developers. In software, the rise of large-scale data science is driven by developers who come from diverse backgrounds who demand the rapid prototyping and interactive-notebook capabilities of high-productivity languages like Python.
In this talk, we summarize recent results from a new project on Automating Massively Parallel Heterogeneous Computing (AMPHC) for Python programmers. This project has three main pillars: 1) a "distillation" pre-pass to normalize and clean up common Python code to make it suitable for compiler analysis and optimization, 2) the Intrepydd programming system which optimizes selected kernel functions from the output of distillation via ahead-of-time (AOT) compilation, and 3) a Ray-based runtime system (Ray-AMPHC) that schedules tasks created by the Intrepydd compiler across multiple GPUs and multiple nodes in a cluster. Preliminary results from the current AMPHC prototype implementation will be presented for two applications domains -- the Space-Time Adaptive Processing (STAP) application from the signal processing domain, and Scalable Dataframe Processing kernels from the data analytics domain.
Vivek Sarkar is the chair of the School of Computer Science and the Stephen Fleming Chair for Telecommunications in the College of Computing at Georgia Institute of Technology, since August 2017. Prior to joining Georgia Tech, Sarkar was a Professor of Computer Science at Rice University, and the E.D. Butcher Chair in Engineering. During 2007 - 2017, Sarkar led Rice's Habanero Extreme Scale Software Research Laboratory which focused on unifying parallelism and concurrency elements of high-end computing, multicore, and embedded software stacks (http://habanero.rice.edu). He also served as Chair of the Department of Computer Science at Rice during 2013 - 2016.
Prior to joining Rice in 2007, Sarkar was Senior Manager of Programming Technologies at IBM Research. His research projects at IBM included the X10 programming language, the Jikes Research Virtual Machine for the Java language, the ASTI optimizer used in IBM’s XL Fortran product compilers, and the PTRAN automatic parallelization system. Sarkar became a member of the IBM Academy of Technology in 1995, and was inducted as an ACM Fellow in 2008. He has been serving as a member of the US Department of Energy’s Advanced Scientific Computing Advisory Committee (ASCAC) since 2009, and on CRA’s Board of Directors since 2015.
Arm technology is becoming increasingly important in HPC. Recently, Fugaku, an Arm-based system, was awarded the number one place in the Top500 list. Raspberry Pis provide an inexpensive platform to become familiar with this architecture. However, Pis can also be useful on their own. Here we describe our efforts to configure and benchmark the use of a Raspberry Pi cluster with the HPX/Phylanx platform (normally intended for use with HPC applications) and document the lessons we learned. First, we highlight the required changes in the configuration of the Pi to gain performance. Second, we explore how limited memory bandwidth limits the use of all cores in our shared memory benchmarks. Third, we evaluate whether low network bandwidth affects distributed performance. Fourth, we discuss the power consumption and the resulting trade-off in cost of operation and performance.
Arm technology is becoming increasingly important in HPC. Recently, Fugaku, an Arm-based system, was awarded the number one place in the Top500 list. Raspberry Pis provide an inexpensive platform to become familiar with this architecture. However, Pis can also be useful on their own. Here we describe our efforts to configure and benchmark the use of a Raspberry Pi cluster with the HPX/Phylanx platform (normally intended for use with HPC applications) and document the lessons we learned. First, we highlight the required changes in the configuration of the Pi to gain performance. Second, we explore how limited memory bandwidth limits the use of all cores in our shared memory benchmarks. Third, we evaluate whether low network bandwidth affects distributed performance. Fourth, we discuss the power consumption and the resulting trade-off in cost of operation and performance.
With the end of Moore’s Law due to the approach of nano-scale semiconductor fabrication and the barriers imposed on clock rates from the limits on power, the next generation of extreme-scale computing maybe enabled by non von Neumann architecture and alternative programming/runtime models. Analysis demonstrates the promise of one to two orders of magnitude opportunities even with today’s enabling technologies for throughput, bandwidth gain, latency reduction, and broadened scope of applications extending to scalable dynamic graph-based problems from Adaptive Mesh Refinement numerical domains to data analytics and AI symbolic computations. This presentation will explore an emerging class of simple but powerful memory-centric HPC accelerators, their execution models, and aspects of innovative programming interface semantic constructs and supporting runtime mechanisms for enhanced performance, user productivity, and performance portability.
Dr. Thomas Sterling holds the position of Professor of Intelligent Systems Engineering at the Indiana University (IU) School of Informatics, Computing, and Engineering. Since receiving his Ph.D from MIT in 1984 as a Hertz Fellow Dr. Sterling has engaged in applied research in fields associated with parallel computing system structures, semantics, and operation in industry, government labs, and academia. Dr. Sterling is best known as the "father of Beowulf" for his pioneering research in commodity/Linux cluster computing. He was awarded the Gordon Bell Prize in 1997 with his collaborators for this work. He was the PI of the HTMT Project sponsored by NSF, DARPA, NSA, and NASA to explore advanced technologies and their implication for high-end system architectures. Other research projects included the DARPA DIVA PIM architecture project with USC-ISI, the Cray Cascade Petaflops architecture project sponsored by the DARPA HPCS Program, and the Gilgamesh high-density computing project at NASA JPL. Dr. Sterling is the co-author of six books and holds six patents. He was the recipient of the 2013 Vanguard Award, and in 2014 he was named a fellow of the American Association for the Advancement of Science.
This talk will evaluate different ways to program modern GPU supercomputers. Unfortunately, all of the methods that exist today have flaws, so I will describe some goals and requirements for better multi-GPU programming models that may be built in the future.
Jeff Hammond is a Research Scientist in the Parallel Computing Lab at Intel Labs. His research interests include: one-sided and global view programming models, load-balancing for irregular algorithms, and shared-and distributed-memory tensor contractions. He has a long-standing interest in enabling the simulation of physical phenomena - primarily the behavior of molecules and materials at atomistic resolution - with massively parallel computing.
Prior to joining Intel, Jeff was an Assistant Computational Scientist at the Argonne Leadership Computing Facility and a Fellow of the University of Chicago Computation Institute. He was a Director's Postdoctoral Fellow at Argonne from 2009 to 2011. In 2009, Jeff received his PhD in chemistry from the University of Chicago as a Department of Energy Computational Science Graduate Fellow. He graduated from the University of Washington with degrees in chemistry and mathematics in 2003.
The IEEE Technical Committee on Scalable Computing named Jeff a Young Achiever in Scalable Computing in 2014 for his work on massively parallel scientific applications and runtime systems.
Nectarios Koziris, National Technical University of Athens, Greece (Moderator)
Paul Carpenter, Barcelona Supercomputing Center, Spain
Tarek El-Ghazawi, The George Washington University, USA
Jeff Hammond, Intel, USA
Dimitrios Nikolopoulos, Virginia Tech, USA
Thomas Sterling, Indiana University, USA
The Fortran 2018 standard introduced syntax and semantics that allow a parallel application to recover from failed images (fail-stop processes) during execution. Teams are a key new language feature that facilitates this capability for applications that use collective subroutines: when a team of images is partitioned into one or more sets of new teams, only active images comprise the new teams; failed images are excluded. This paper summarizes the language facilities for handling failed images specified in the Fortran 2018 standard and subsequent interpretations by the US Fortran Programming Language Standards Technical Committee. We propose standardizing some semantics that have been left processor (implementation) dependent to enable the development of portable fault-tolerant parallel Fortran applications. Finally, we present a prototype implementation of a substantial subset of the Fortran 2018 failed images functionality, including semantic changes proposed herein. This prototype comprises OpenCoarrays, with failed-images enhancements constructed using Open MPI ULFM routines, and a GFortran compiler customized to support additional syntax needed to enable fault tolerant execution of image control statements.
As HPC progresses toward exascale, writing applications that are highly efficient, portable, and support programmer productivity is becoming more challenging than ever. The growing scale, diversity, and heterogeneity in compute platforms increases the burden on software to efficiently use available distributed parallel resources. This burden has fallen on developers who, increasingly, are experts in application domains rather than traditional computer scientists and engineers. We propose CASPER-Compiler Abstractions Supporting high Performance on Extreme-scale Resources-a novel domain-specific compiler and runtime framework to enable domain scientists to achieve high performance and scalability on complex HPC systems. CASPER extends domain-specific languages with machine learning to map software tasks to distributed, heterogeneous resources, and provides a runtime framework to support a variety of adaptive optimizations in dynamic environments. This paper presents an initial design and analysis of CASPER for synthetic aperture radar and computational fluid dynamics domains.
We describe TESSE, an emerging general-purpose, open-source software ecosystem that attacks the twin challenges of programmer productivity and portable performance for advanced scientific applications on modern high-performance computers. TESSE builds upon and extends the ParsecDAG/-dataflow runtime with a new Domain Specific Languages (DSL) and new integration capabilities. Motivating this work is our belief that such a dataflow model, perhaps with applications composed in domain specific languages, can overcome many of the challenges faced by a wide variety of irregular applications that are poorly served by current programming and execution models. Two such applications from many-body physics and applied mathematics are briefly explored. This paper focuses upon the Template Task Graph (TTG), which is TESSE's main C++ Api that provides a powerful work/data-flow programming model. Algorithms on spatial trees, block-sparse tensors, and wave fronts are used to illustrate the API and associated concepts, as well as to compare with related approaches.