Organizing Committee

Program Chairs

Program Committee

Tentative Program

All Times Are U.S. CST

9:00 - 9:05

Opening Remarks

Hari Subramoni, Aamir Shafi, Karl Schulz, and Dhabaleswar K (DK) Panda
The Ohio State University and UT Austin


The continuous increase in complexity and scale of high-end systems, together with the evolving diversity of processor options, are forcing computational scientists to face system characteristics that can significantly impact the performance and scalability of applications. HPC users need a system infrastructure that can adapt to their workload needs, rather than having to constantly redesign their applications to adapt to new systems. In this talk, I will discuss the current trends in computer architecture and the implications in the development of HPC applications and programming and middleware environments. I will present the Oracle Cloud Infrastructure (OCI), which provides availability, resiliency, and performance at scale, so HPC users can easily choose the best option for their workloads, and will discuss hybrid on-prem/cloud options, which facilitate workload migration from on-premise to the cloud. I will finish the presentation with a discussion of some of the challenges and open research problems that still need to be addressed in this area.


Luiz DeRose

Dr. Luiz DeRose is a Director of Cloud Engineering for HPC at Oracle. Before joining Oracle, He was a Sr. Science Manager at AWS, and a Senior Principal Engineer and the Programming Environments Director at Cray. Dr. DeRose has a Ph.D. in Computer Science from the University of Illinois at Urbana-Champaign. He has more than 25 years of high-performance computing experience and a deep knowledge of programming and middleware environments for HPC. Dr. DeRose has eight patents and has published more than 50 peer-review articles in scientific journals, conferences, and book chapters, primarily on the topics of compilers and tools for high performance computing.

10:00 - 10:30

Morning Coffee Break

Scalable parallel algorithm for fast computation of Transitive Closure on Shared Memory Architectures

Bhaskar Chaudhury1, Bhrugu Dave2, Mihir Desai2, Sidharth Kumar3, Smit Kumbhani2, Sarthak Patel1
Dhirubhai Ambani Institute of Information and Communication Technology1, Group in Computational Science and HPC, DA-IICT, India2 University of Alabama, Birmingham3

We present a scalable algorithm that computes the transitive closure of a graph on shared memory architectures using the OpenMP API in C++. Two different parallelization strategies have been presented and the performance of the two algorithms has been compared for several data-sets of varying sizes. We demonstrate the scalability of the best parallel implementation up to 176 threads on a shared memory architecture, by producing a graph with more than 3.82 trillion edges. To the best of our knowledge, this is the first implementation that has computed the transitive closure of such a large graph on a shared memory system. Optimization strategies for better cache utilization for large data-sets have been discussed. The important issue of load balancing has been analyzed and its mitigation using the optimal OpenMP scheduling clause has been discussed in details.

Accelerating Messages by Avoiding Copies in an Asynchronous Task-based Programming Model

Nitin Bhat1, Laxmikant Kale2, Evan Ramos1, Sam White2
Charmworks Inc1, University of Illinois, Charmworks Inc2

Task-based programming models promise improved communication performance for irregular, fine-grained, and load imbalanced applications. They do so by relaxing some of the messaging semantics of stricter models and taking advantage of those at the lower-levels of the software stack. For example, while MPI's two-sided communication model guarantees in-order delivery, requires matching sends to receives, and has the user schedule communication, task-based models generally favor the runtime system scheduling all execution based on the dependencies and message deliveries as they happen. The messaging semantics are critical to enabling high performance. In this paper, we build on previous work that added zero copy semantics to Converse/LRTS. We examine the messaging semantics of Charm++ as it relates to large message buffers, identify shortcomings, and define new communication APIs to address them. Our work enables in-place communication semantics in the context of point-to-point messaging, broadcasts, transmission of read-only variables at program startup, and for migration of chares. We showcase the performance of our new communication APIs using benchmarks for Charm++ and Adaptive MPI, which result in nearly 90% latency improvement and 2x lower peak memory usage.

Parallel SIMD - A Policy Based Solution for Free Speed-Up using C++ Data-Parallel Types

Nikunj Gupta1, Hartmut Kaiser2, Auriane Reverdell3, S Srinivas Yadav4
University of Illinois at Urbana-Champaign1, Louisiana State University2, Swiss National Supercomputing Centre3, Keshav Memorial Institute of Technology; Louisiana State University, Center for Computation and Technology4

Recent additions to the C++ standard and ongoing standardization efforts aim to add data-parallel types to the C++ standard library. This enables the use of vectorization techniques in existing C++ codes without having to rely on the C++ compiler's abilities to auto-vectorize the code's execution. The integration of the existing parallel algorithms with these new data-parallel types opens up a new way of speeding up existing codes with minimal effort. Today, only very little implementation experience exists for potential data-parallel execution of the standard parallel algorithms. In this paper, we report on experiences and performance analysis results for our implementation of two new data-parallel execution policies usable with HPX's parallel algorithms module: simd and par_simd. We utilize the new experimental implementation of data-parallel types provided by recent versions of the GNU GCC and Clang C++ standard libraries. The benchmark results collected from artificial tests and real-world codes presented in this paper are very promising. Compared to sequenced execution, we report on speed-ups of more than three orders of magnitude when executed using the newly implemented data-parallel execution policy par_simd with HPX's parallel algorithms. We also report that our implementation is performance portable across different compute architectures (x64 -- Intel and AMD, and Arm), using different vectorization technologies (AVX2, AVX512, NEON64, and NEON128).

Taskflow-San: Sanitizing Erroneous Control Flow in Taskflow Programs

Tsung-Wei Huang1, Mckay Mower1, Lukas Majors1
University of Utah1,

Taskflow is a general-purpose parallel and heterogeneous task graph programming system that enables in-graph control flow to express end-to-end parallelism. By integrating control-flow decisions into condition tasks, developers can efficiently overlap CPU-GPU dependent tasks both inside and outside control flow, largely enhancing the capability of task graph parallelism. Condition tasks are powerful but also prone to mistake. For large task graphs, users can easily encounter erroneous control-flow tasks that cannot be correctly scheduled by the Taskflow runtime. To overcome this challenge, this paper introduces a new instrumentation module, Taskflow-San, to assist users to detect erroneous control-flow tasks in Taskflow graphs.

12:30 - 2:00

Lunch Break

Performance Evaluation of Python Parallel Programming Models: Charm4Py and mpi4py

Zane Fink1
University of Illinois at Urbana-Champaign1,

Python is rapidly becoming the ingua franca of machine learning and scientific computing. With the broad use of frameworks such as Numpy, SciPy, and TensorFlow, scientific computing and machine learning are seeing a productivity boost on systems without a requisite loss in performance. While high-performance libraries often provide adequate performance within a node, distributed computing is required to scale Python across nodes and make it truly competitive in large-scale high-performance computing. Many frameworks, such as Charm4Py, DaCe, Dask, Legate Numpy, mpi4py, and Ray, scale Python across nodes. However, little is known about these frameworks' relative strengths and weaknesses, leaving practitioners and scientists without enough information about which frameworks are suitable for their requirements. In this paper, we seek to narrow this knowledge gap by studying the relative performance of two such frameworks: Charm4Py and mpi4py. We perform a comparative performance analysis of Charm4Py and mpi4py using CPU and GPU-based microbenchmarks, including TaskBench and other representative mini-apps for scientific computing.

Evaluation of Distributed Tasks in Stencil-based Application on GPUs

Jonathon M Anderson1, Mauricio Araya2, Jie Meng2, Eric K Raut3
Rice University1, Total E&P Research and Technology US LLC2 Stony Brook University3

In the era of exascale computing, the traditional MPI+X paradigm starts losing its strength in taking advantage of heterogeneous systems. Subsequently, research and development on finding alternative programming models and runtimes have become increasingly popular. This encourages comparison, on competitive grounds, of these emerging parallel programming approaches against the traditional MPI+X paradigm. In this work, an implementation of distributed task-based stencil numerical simulation is compared with a MPI+X implementation of the same application. To be more specific, the Legion task-based parallel programming system is used as an alternative to MPI at out-of-node level, while the underlying CUDA-implemented kernels are kept at node level. Therefore, the comparison is as fair as possible and focused on the distributed aspects of the simulation. Overall, the results show that the task-based approach is on par with the traditional MPI approach in terms of both performance and scalability.

3:00 - 3:30

Afternoon Coffee Break

Panel Members

Ritu Arora, UT San Antonio (Moderator)

Joaquin Chung, Argonne National Laboratory

John Martinis, University of California Santa Barbara

Harsha Nagarajan, Los Alamos National Laboratory

Andrés Paz, Microsoft

4:55 - 5:00

Closing Remarks

Hari Subramoni, Aamir Shafi, Karl Schulz, and Dhabaleswar K (DK) Panda
The Ohio State University and UT Austin