The continuous increase in complexity and scale of high-end systems, together with the evolving diversity of processor options, are forcing computational scientists to face system characteristics that can significantly impact the performance and scalability of applications. HPC users need a system infrastructure that can adapt to their workload needs, rather than having to constantly redesign their applications to adapt to new systems. In this talk, I will discuss the current trends in computer architecture and the implications in the development of HPC applications and programming and middleware environments. I will present the Oracle Cloud Infrastructure (OCI), which provides availability, resiliency, and performance at scale, so HPC users can easily choose the best option for their workloads, and will discuss hybrid on-prem/cloud options, which facilitate workload migration from on-premise to the cloud. I will finish the presentation with a discussion of some of the challenges and open research problems that still need to be addressed in this area.
Dr. Luiz DeRose is a Director of Cloud Engineering for HPC at Oracle. Before joining Oracle, He was a Sr. Science Manager at AWS, and a Senior Principal Engineer and the Programming Environments Director at Cray. Dr. DeRose has a Ph.D. in Computer Science from the University of Illinois at Urbana-Champaign. He has more than 25 years of high-performance computing experience and a deep knowledge of programming and middleware environments for HPC. Dr. DeRose has eight patents and has published more than 50 peer-review articles in scientific journals, conferences, and book chapters, primarily on the topics of compilers and tools for high performance computing.
We present a scalable algorithm that computes the transitive closure of a graph on shared memory architectures using the OpenMP API in C++. Two different parallelization strategies have been presented and the performance of the two algorithms has been compared for several data-sets of varying sizes. We demonstrate the scalability of the best parallel implementation up to 176 threads on a shared memory architecture, by producing a graph with more than 3.82 trillion edges. To the best of our knowledge, this is the first implementation that has computed the transitive closure of such a large graph on a shared memory system. Optimization strategies for better cache utilization for large data-sets have been discussed. The important issue of load balancing has been analyzed and its mitigation using the optimal OpenMP scheduling clause has been discussed in details.
Task-based programming models promise improved communication performance for irregular, fine-grained, and load imbalanced applications. They do so by relaxing some of the messaging semantics of stricter models and taking advantage of those at the lower-levels of the software stack. For example, while MPI's two-sided communication model guarantees in-order delivery, requires matching sends to receives, and has the user schedule communication, task-based models generally favor the runtime system scheduling all execution based on the dependencies and message deliveries as they happen. The messaging semantics are critical to enabling high performance. In this paper, we build on previous work that added zero copy semantics to Converse/LRTS. We examine the messaging semantics of Charm++ as it relates to large message buffers, identify shortcomings, and define new communication APIs to address them. Our work enables in-place communication semantics in the context of point-to-point messaging, broadcasts, transmission of read-only variables at program startup, and for migration of chares. We showcase the performance of our new communication APIs using benchmarks for Charm++ and Adaptive MPI, which result in nearly 90% latency improvement and 2x lower peak memory usage.
Recent additions to the C++ standard and ongoing standardization efforts aim to add data-parallel types to the C++ standard library. This enables the use of vectorization techniques in existing C++ codes without having to rely on the C++ compiler's abilities to auto-vectorize the code's execution. The integration of the existing parallel algorithms with these new data-parallel types opens up a new way of speeding up existing codes with minimal effort. Today, only very little implementation experience exists for potential data-parallel execution of the standard parallel algorithms. In this paper, we report on experiences and performance analysis results for our implementation of two new data-parallel execution policies usable with HPX's parallel algorithms module: simd and par_simd. We utilize the new experimental implementation of data-parallel types provided by recent versions of the GNU GCC and Clang C++ standard libraries. The benchmark results collected from artificial tests and real-world codes presented in this paper are very promising. Compared to sequenced execution, we report on speed-ups of more than three orders of magnitude when executed using the newly implemented data-parallel execution policy par_simd with HPX's parallel algorithms. We also report that our implementation is performance portable across different compute architectures (x64 -- Intel and AMD, and Arm), using different vectorization technologies (AVX2, AVX512, NEON64, and NEON128).
Taskflow is a general-purpose parallel and heterogeneous task graph programming system that enables in-graph control flow to express end-to-end parallelism. By integrating control-flow decisions into condition tasks, developers can efficiently overlap CPU-GPU dependent tasks both inside and outside control flow, largely enhancing the capability of task graph parallelism. Condition tasks are powerful but also prone to mistake. For large task graphs, users can easily encounter erroneous control-flow tasks that cannot be correctly scheduled by the Taskflow runtime. To overcome this challenge, this paper introduces a new instrumentation module, Taskflow-San, to assist users to detect erroneous control-flow tasks in Taskflow graphs.
Python is rapidly becoming the ingua franca of machine learning and scientific computing. With the broad use of frameworks such as Numpy, SciPy, and TensorFlow, scientific computing and machine learning are seeing a productivity boost on systems without a requisite loss in performance. While high-performance libraries often provide adequate performance within a node, distributed computing is required to scale Python across nodes and make it truly competitive in large-scale high-performance computing. Many frameworks, such as Charm4Py, DaCe, Dask, Legate Numpy, mpi4py, and Ray, scale Python across nodes. However, little is known about these frameworks' relative strengths and weaknesses, leaving practitioners and scientists without enough information about which frameworks are suitable for their requirements. In this paper, we seek to narrow this knowledge gap by studying the relative performance of two such frameworks: Charm4Py and mpi4py. We perform a comparative performance analysis of Charm4Py and mpi4py using CPU and GPU-based microbenchmarks, including TaskBench and other representative mini-apps for scientific computing.
In the era of exascale computing, the traditional MPI+X paradigm starts losing its strength in taking advantage of heterogeneous systems. Subsequently, research and development on finding alternative programming models and runtimes have become increasingly popular. This encourages comparison, on competitive grounds, of these emerging parallel programming approaches against the traditional MPI+X paradigm. In this work, an implementation of distributed task-based stencil numerical simulation is compared with a MPI+X implementation of the same application. To be more specific, the Legion task-based parallel programming system is used as an alternative to MPI at out-of-node level, while the underlying CUDA-implemented kernels are kept at node level. Therefore, the comparison is as fair as possible and focused on the distributed aspects of the simulation. Overall, the results show that the task-based approach is on par with the traditional MPI approach in terms of both performance and scalability.
Ritu Arora, UT San Antonio (Moderator)
Joaquin Chung, Argonne National Laboratory
John Martinis, University of California Santa Barbara
Harsha Nagarajan, Los Alamos National Laboratory
Andrés Paz, Microsoft