Machines with peak performance exceeding one exaflop/s are just around the corner, and promises of sustained exaflop/s machines abound. Are there significant challenges in runtime frameworks and languages that need to be met to harness the power of these machines? We will examine this question and associated issues.
There are some architectural trends that are becoming clear and some hazily appearing. Individual nodes are getting “fatter” computationally. Accelerators such as GPGPUs and possibly FPGAs are likely parts of the exascale landscape. High bandwidth memory, and non-coherent caches (such as the caches in GPGPUs used typically for constant memory), NVRAMS, and resultant deeper and more complex memory hierarchies will also have to be dealt with.
There is an argument going around in the community, that we have already figured out how to deal with tens of thousands of nodes (100,000 with BG/Q), and now since the number of nodes is not likely to increase, we (the extreme-scale HPC community) have to focus research almost entirely on within-node issues. I believe this is not quite a well-founded argument. I will explain why issues of power/energy/temperature, whole machine (multi-job) optimizations, across node issues like communication optimization, load balancing and fault tolerance are still worthy of significant attention of the exascale runtime and language community. At the same time, there exist issues in handling within-node parallelism that arise mainly or only in the context large multi-node runs.
I will also address the question of how our community should approach research if a large segment of funding sources and application community have started thinking some of the above issues are irrelevant. What should the courage of our convictions lead us to?
Laxmikant (Sanjay) Kale is the director of the Parallel Programming Laboratory and the Paul and Cynthia Saylor Professor of Computer Science at the University of Illinois at Urbana-Champaign, where he has worked since 1985. His research spans parallel computing, with a focus on adaptive runtime systems, and guided by the belief that only interdisciplinary research involving multiple CSE applications can lead to well-honed abstractions with potential for a long-term impact. His collaborations include biomolecular simulation program NAMD (Gordon-Bell award, 2002), and applications in computational cosmology, quantum chemistry etc. He takes pride in his group's success in distributing software embodying their research ideas, including Charm++, and AMPI.
His degrees include B.Tech, Electronics Engineering from BHU, India (1977), M.E., IISc Bangalore, India (1979) and Ph.D., Computer Science, SUNY, Stony Brook (1985). Prof. Kale is a fellow of the ACM and IEEE, and a winner of the 2012 IEEE Sidney Fernbach award.
Futures are a widely-used abstraction for enabling deferred execution in imperative programs. Deferred execution enqueues tasks rather than explicitly blocking and waiting for them to execute. Many task-based programming models with some form of deferred execution rely on explicit parallelism that is the responsibility of the programmer. Deterministic-by-default (implicitly parallel) models instead use data effects to derive concurrency automatically, alleviating the burden of concurrency management. Both implicitly and explicitly parallel models are particularly challenging for imperative object-oriented programming. Fine-granularity parallelism across member functions or amongst data members may exist, but is often ignored. In this work, we define a general permissions model that leverages the C++ type system and move semantics to define an asynchronous programming model embedded in the C++ type system. Although a default distributed memory semantic is provided, the concurrent semantics are entirely configurable through C++ constexpr integers. Correct use of the defined semantic is verified at compile-time, allowing deterministic- by-default concurrency to be safely added to applications. Here we demonstrate the use of these “extended futures” for distributed memory asynchronous communication and load balancing. An MPI particle-in-cell application is modified with the wrapper class using this task model, with results presented for a Haswell system up to 64 nodes.
As both the complexity of algorithms and architecture increase, development of scientific software becomes a challenge. In order to exploit future architecture, we consider a Multi-SPMD workflow programing model. Then, data transfer between tasks during computation highly depends on the architecture and middleware used. In this study, we design an adaptive system for data management in a parallel programming environment which can express two level of parallelism. We show how the consideration of multiple strategies based on I/O and direct message passing can improve performances and fault tolerance in the YML-XMP environment. On a real application with a sufficiently large amount of local data, speedup of 1.36 for a mixed strategy to 1.73 for a direct message passing method are obtained compared to our original design.
Experience shows that on today's high performance systems the utilization of different acceleration cards in conjunction with a high utilization of all other parts of the system is difficult. Future architectures, like exascale clusters, are expected to aggravate this issue as the number of cores are expected to increase and memory hierarchies are expected to become deeper. One big aspect for distributed applications is to guarantee high utilization of all available resources, including local or remote acceleration cards on a cluster while fully using all the available CPU resources and the integration of the GPU work into the overall programming model.
For the integration of CUDA code we extended HPX and enabled asynchronous data transfers from and to the GPU device and the asynchronous invocation of CUDA kernels on this data. Both operations are well integrated into the general programming model of HPX which allows to seamlessly overlap any GPU operation with work on the main cores. Any user defined CUDA kernel can be launched.
We present asynchronous implementations for the data transfers and kernel launches for CUDA code as part of a HPX asynchronous execution graph. Using this approach we can combine all remotely and locally available acceleration cards on a cluster to utilize its full performance capabilities. Overhead measurements show, that the integration of the asynchronous operations as part of the HPX execution graph imposes no additional computational overhead and significantly eases orchestrating coordinated and concurrent work on the main cores and the used GPU devices.
In this paper we describe the basic idea, implementation and achieved performance of our DSL for stencil computation, Formura, on systems based on PEZY-SC2 many-core processor. Formura generates, from high-level description of the differential equation and simple description of finite-difference stencil, the entire simulation code with MPI parallelization with overlapped communication and calculation, advanced temporal blocking and parallelization for many-core processors. Achieved performance is 4.78 PF, or 21.5\% of the theoretical peak performance for an explicit scheme for compressive CFD, with the accuracy of fourth-order in space and third-order in time. For a slightly modified implementation of the same scheme, efficiency was slightly lower (17.5\%) but actual calculation time per one timestep was faster by 25\%. Temporal blocking improved the performance by up to 70\%. Even though the B/F number of PEZY-SC2 is low, around 0.02, we have achieved the efficiency comparable to those of highly optimized CFD codes on machines with much higher memory bandwidth such as K computer. We have demonstrated that automatic generation of the code with temporal blocking is a quite effective way to make use of very large-scale machines with low memory bandwidth for large-scale CFD calculations.
Despite advancements in the areas of par-allel and distributed computing, the complexity ofprogramming on High Performance Computing (HPC)resources has deterred many domain experts, espe-cially in the areas of machine learning and artificialintelligence (AI), from utilizing performance benefitsof such systems. Researchers and scientists favor high-productivity languages to avoid the inconvenience ofprogramming in low-level languages and costs of ac-quiring the necessary skills required for programmingat this level. In recent years, Python, with the sup-port of linear algebra libraries like NumPy, has gainedpopularity despite facing limitations which prevent thiscode from distributed runs. Here we present a solutionwhich maintains both high level programming extrac-tions as well as parallel and distributed efficiency. Phy-lanx, is an asynchronous array processing toolkit whichtransforms Python and NumPy operations into codewhich can be executed in parallel on HPC resources bymapping Python and NumPy functions and variablesinto a dependency tree executed by HPX, a generalpurpose, parallel, task-based runtime system writtenin C++. Phylanx additionally provides introspectionand visualization capabilities for debugging and per-formance analysis. We have tested foundations of ourapproach by comparing our implementation of widelyused machine learning algorithms to accepted NumPystandards.
A well-recognized characteristic of extreme scale systems is that their computation bandwidths far exceed their communication bandwidths. PGAS runtimes have proven to be effective in enabling efficient use of communication bandwidth, due to their efficient support for short nonblocking one-sided messages. However, they were not designed for exploiting the massive levels of intra-node parallelism found in extreme scale systems.
Advanced programming models, domain specific languages, and scripting toolkits have the potential to greatly accelerate the adoption of high performance computing. These complex software systems, however, are often difficult to install and maintain, especially on exotic high-end systems. We consider deep learning workflows used on petascale systems and redeployment on research clusters using containers. Containers are used to deploy the MPI-based infrastructure, but challenges in efficiency, usability, and complexity must be overcome. In this work, we address these challenges through enhancements to a unified workflow system that manages interaction with the container abstraction, the cluster scheduler, and the programming tools. We also report results from running the application on our system, harnessing 298~TFLOPS (single precision).
Dynamic task-based parallelism has become a widely-accepted paradigm in the quest for exascale computing. In this work, we deliver a non-trivial demonstration of the advantages of explicit over implicit tasking in OpenMP 4.5 in terms of both expressiveness and performance. We target the Kripke benchmark, a mini-application used to test the performance of discrete particle codes, and find that the dependence structure of the core “sweep” kernel is well-suited for dynamic task-based systems. Our results show that explicit tasking delivers a 31.7% and 8.1% speedup over a pure implicit implementation for a small and large problem, respectively, while a hybrid variant also underperforms the explicit variant by 13.1% and 5.8%, respectively.
For many years, HPC systems have consisted of predominately homogenous arrangements of cheap commodity components. More recently, power limitations are leading to stricter energy efficiency demands and new applications and use-cases, such as AI and data analytics, are driving innovations in hardware targeted for HPC systems and large datacenters. Accelerators like GPGPUs take some of the traditional CPU workload but have different design goals and capabilities. High bandwidth memory and non-volatile memory change the expectations and fundamental concepts that underpin traditional programming languages. Compute-in-network, network attached memory/storage, and FPGAs combine functionalities that were traditionally separate, which our traditional programming models struggle to represent well. The driving forces behind these hardware innovations and trends seem set to continue as HPC pushes towards Exascale and beyond. What is being done to help HPC and AI programmers navigate this exciting and changing landscape?
Richard Graham, Senior Solutions Architect, Mellanox Technologies
Adrian Jackson, Research Architect, EPCC, The University of Edinburgh, UK
Chris J. Newburn (CJ), Principal HPC Architect, NVIDIA
Ashish Sirasao, Fellow Engineer, Software and IP team, Xilinx Inc