After over two decades of relative architectural stability for distributed memory parallel computers, the end of Dennard scaling and the looming end of Moore's "law" is forcing major changes in computing systems. Can the community continue to use programming systems such as MPI for extreme scale systems? Does the growing complexity of compute nodes require new programming approaches? This talk will discuss some of the issues, with emphasis on internode and intranode programming systems and the connections between them.
Dr. William "Bill" Gropp is NCSA Director and Chief Scientist, and the Thomas M. Siebel Chair in the Department of Computer Science. Dr. Gropp recently co-chaired the National Academy's Committee on Future Directions for NSF Advanced Computing Infrastructure to Support U.S. Science. In 2016, the Association for Computing Machinery (ACM) and IEEE Computer Society named Gropp, a professor of computer science at the University of Illinois at Urbana-Champaign the recipient of the 2016 ACM/IEEE Computer Society Ken Kennedy Award for highly influential contributions to the programmability of high-performance parallel and distributed computers.
Since 2008, he has also been Deputy Director for Research for the Institute of Advanced Computing Applications and Technologies at the University of Illinois. His research interests are in parallel computing, software for scientific computing, and numerical methods for partial differential equations. He has played a major role in the development of the MPI message-passing standard. He is co-author of the most widely used implementation of MPI, MPICH, and was involved in the MPI Forum as a chapter author for both MPI-1 and MPI-2. He has written many books and papers on MPI including "Using MPI" and "Using MPI-2." He is also one of the designers of the PETSc parallel numerical library, and has developed efficient and scalable parallel algorithms for the solution of linear and nonlinear equations.
Large-scale parallel applications with complex global data dependencies beyond those of reductions pose significant scalability challenges in an asynchronous runtime system. Internodal challenges include identifying the all-to-all communication of data dependencies among the nodes. Intranodal challenges include gathering together these data dependencies into usable data objects while avoiding data duplication. This paper addresses these challenges within the context of a large-scale, industrial coal boiler simulation using the Uintah asynchronous many-task runtime system on GPU architectures. We show significant reduction in time spent analyzing data dependencies through refinements in our dependency search algorithm. Multiple task graphs are used to eliminate subsequent analysis when task graphs change in predictable and repeatable ways. Using a combined data store and task scheduler redesign reduces data dependency duplication ensuring that problems fit within host and GPU memory. These modifications did not require any changes to application code or sweeping changes to the Uintah runtime system. We report results running on the DOE Titan system on 119K CPU cores and 7.5K GPUs simultaneously. Our solutions can be generalized to other task dependency problems with global dependencies among thousands of nodes which must be processed efficiently at large scale.
The performance of many parallel applications depends on looplevel parallelism. However, manually parallelizing all loops may result in degrading parallel performance, as some of them cannot scale desirably to a large number of threads. In addition, the overheads of manually tuning loop parameters might prevent an application from reaching its maximum parallel performance.We illustrate how machine learning techniques can be applied to address these challenges. In this research, we develop a framework that is able to automatically capture the static and dynamic information of a loop. Moreover, we advocate a novel method by introducing HPX smart executors for determining the execution policy, chunk size, and prefetching distance of an HPX loop to achieve higher possible performance by feeding static information captured during compilation and runtime-based dynamic information to our learning model. Our evaluated execution results show that using these smart executors can speed up the HPX execution process by around 12% − 35% for the Matrix Multiplication, Stream and 2D Stencil benchmarks compared to setting their HPX loop’s execution policy/parameters manually or using HPX auto-parallelization techniques.
The recent trend of rapid increase in the number of cores per chip has resulted in vast amounts of on-node parallelism. These high core counts result in hardware variability that introduces imbalance. Applications are also becoming more complex themselves, resulting in dynamic load imbalance. Load imbalance of any kind can result in loss of performance and decrease in system utilization. In this paper, we propose a new integrated runtime system that adds OpenMP shared-memory parallelism to the Charm++ distributed programming model to improve load balancing on distributed systems. Our proposal utilizes an infrequent periodic assignment of work to cores based on load measurement, in combination with tasks created via OpenMP’s parallel loop construct from each core to handle load imbalance. We demonstrate the benefits of using this integrated runtime system on the LLNL ASC proxy application Lassen, achieving speedups of 50% over runs without any load balancing and 10% over existing distributed-memory-only balancing schemes in Charm++.
With the shift to exascale computer systems, the importance of productive programming models for distributed systems is increasing. Partitioned Global Address Space (PGAS) programming models aim to reduce the complexity of writing distributed-memory parallel programs by introducing global operations on distributed arrays, distributed task parallelism, directed synchronization, and mutual exclusion. However, a key challenge in the application of PGAS programming models is the improvement of compilers and runtime systems. In particular, one open question is how runtime systems meet the requirement of exascale systems, where a large number of asynchronous tasks are executed.
While there are various tasking runtimes such as Qthreads, OCR, and HClib, there is no existing comparative study on PGAS tasking/ threading runtime systems. To explore runtime systems for PGAS programming languages, we have implemented OCR-based and HClib-based Chapel runtimes and evaluated them with an initial focus on tasking and synchronization implementations. The results show that our OCR and HClib-based implementations can improve the performance of PGAS programs compared to the existing Qthreads backend of Chapel.
Silent data corruption (SDC) and fail-stop errors are the most hazardous error types in high-performance computing (HPC) systems. In this study, we present an automatic, efficient and lightweight redundancy mechanism to mitigate both error types. We propose partial task-replication and checkpointing for task-parallel HPC applications to mitigate silent and fail-stop errors. To avoid the prohibitive costs of complete replication, we introduce a lightweight selective replication mechanism. Using a fully automatic and transparent heuristics, we identify and selectively replicate only the reliability-critical tasks based on a risk metric. Our approach detects and corrects around 70% of silent errors with only 5% average performance overhead. Additionally, the performance overhead of the heuristic itself is negligible.
Asynchronous Many Task (AMT) runtimes promise application designers the ability to better utilize novel hardware resources and to take advantages of the idle times that might arise from the discrepancies due to mismatches between software and hardware components. To foresee possible problems between hardware and software components (described as mismatches), designers usually use models to predict and analyze application behaviors. However, current models are ill suited for the AMT crowd because of its dynamic behavior and agility. To this effect, we developed an extended roofline model that aims to provide upper bounds on execution for AMT frameworks. This work focuses on the validation and error characterization of this model using different statistical techniques and a large set of experiments to evaluate and characterize its error and its sources. We found out that in the worst case, the error can grow to an order of magnitude, however there are several techniques to increase the model accuracy given a machine configuration.
The Open Community Runtime specification prescribes the way a task-parallel application has to be written, in order to give the runtime system the ability to automatically migrate work and data, provide fault tolerance, improve portability, etc. These constraints prevent an application from efficiently starting a new process to run another external program. We have designed an extension of the specification which provides exactly this functionality in a way that fits the task-based model. The bulk of our work is devoted to exploring the way the task-parallel application can interact with an external application without having to resort to using files on a physical drive for data exchange. To eliminate the need to make changes to the external application, the data is exposed via a virtual file system using the filesystem-in-userspace architecture.
Artificial intelligence (AI) has been an interesting research topic for many decades but has struggled to enter mainstream use. Deep Learning (DL) is one form of AI that has recently become more practicable and useful because of dramatic increases in the computational power and in the amount of training data available. Research labs are already using Deep Learning to progress scientific investigations in numerous fields. Commercial enterprises are starting to make product development and marketing decisions based on machine learning models. However, there is a worrying skills gap between the hype and the reality of getting business benefit from Deep Learning. To address this, we need to answer some urgent questions. What practical programming techniques (specifically, programming models and middleware options) should we be teaching new recruits into this area? What existing knowledge and experience (from HPC or elsewhere) should existing practitioners be leveraging? Do traditional big-iron supercomputers and HPC software techniques (including MPI or PGAS) have a place in this vibrant new sphere or is all about high-level scripting, complex workflows, and elastic cloud resources?
Mike Houston, Senior Distinguished Engineer, NVIDIA
Prabhat, Data and Analytics Group Lead, NERSC, Lawrence Berkeley National Laboratory
Jeff Squyres, MPI Architect at Cisco Systems, Inc.
Rick Stevens, Associate Laboratory Director, Argonne National Laboratory