Extreme Scale computing in HPC, Big Data, Deep Learning and Clouds are marked by multiple-levels of hierarchy and heterogeneity ranging from the compute units (many-core CPUs, GPUs, APUs etc) to storage devices (NVMe, NVMe over Fabrics etc) to the network interconnects (InfiniBand, High-Speed Ethernet, Omni-Path etc). Owing to the plethora of heterogeneous communication paths with different cost models expected to be present in extreme scale systems, data movement is seen as the soul of different challenges for exascale computing. On the other hand, advances in networking technologies such as NoCs (like NVLink), RDMA enabled networks and the likes are constantly pushing the envelope of research in the field of novel communication and computing architectures for extreme scale computing. The goal of this workshop is to bring together researchers and software/hardware designers from academia, industry and national laboratories who are involved in creating network-based computing solutions for extreme scale architectures. The objectives of this workshop will be to share the experiences of the members of this community and to learn the opportunities and challenges in the design trends for exascale communication architectures.

ExaComm welcomes original submissions in a range of areas, including but not limited to:

  • Scalable communication protocols
  • High performance networks
  • Runtime/middleware designs
  • Impact of high performance networks on Deep Learning / Machine Learning
  • Impact of high performance networks on Big Data
  • Novel hardware/software co-design
  • High performance communication solutions for accelerator based computing
  • Power-aware techniques and designs
  • Performance evaluations
  • Quality of Service (QoS)
  • Resource virtualization and SR-IOV

8:50 - 9:00

Opening Remarks

Hari Subramoni and Dhabaleswar K (DK) Panda
The Ohio State University

Abstract

As we enter the exascale era, workloads are becoming increasingly heterogeneous, with AI and data analytics supporting traditional simulation and modeling, and with ever increasing volumes of data being generated and consumed. At the same time, CMOS performance is plateauing, with increased parallelism left as the only route to significant performance gains. This puts increased burden on the system interconnect to scale, to deliver high sustained throughput, and to provide isolation between a variety of concurrent workloads. This talk will describe Slingshot, Cray’s next-generation system interconnect. Slingshot marries Ethernet, for standards-based interoperability, with state-of-the-art HPC features. It provides high sustained throughput via 56G PAM-4 signaling, a very low-diameter topology, and advanced adaptive routing. More importantly, it provides performance isolation between workloads, and low, uniform latency, via highly flexible QoS mechanisms and a unique and highly-effective congestion control mechanism.


Bio

Steve Scott

Steve Scott serves as Senior Vice President and Chief Technology Officer, responsible for guiding the long-term technical direction of Cray’s supercomputing, storage and analytics products. Dr. Scott rejoined Cray in 2014 after serving as principal engineer in the Platforms group at Google and before that as the senior vice president and chief technology officer for NVIDIA’s Tesla business unit. Dr. Scott first joined Cray in 1992, after earning his Ph.D. in computer architecture and BSEE in computer engineering from the University of Wisconsin-Madison. He was the chief architect of several Cray supercomputers and interconnects. Dr. Scott is a noted expert in high performance computer architecture and interconnection networks. He holds 41 U.S. patents in the areas of interconnection networks, cache coherence, synchronization mechanisms and scalable parallel architectures. He received the 2005 ACM Maurice Wilkes Award and the 2005 IEEE Seymour Cray Computer Engineering Award, and is a Fellow of IEEE and ACM. Dr. Scott was named to HPCwire’s “People to Watch in High Performance Computing” in 2012 and 2005.

Abstract

In-Network Computing transforms the data center interconnect to become a “Distributed CPU”, and “Distributed Memory”. It enables to overcome performance barriers and to enable faster and more scalable data analysis. HDR 200G InfiniBand In-Network Computing technology includes several elements: Scalable Hierarchical Aggregation and Reduction Protocol (SHARP), Smart Tag Matching and Rendezvous Protocol, and more. These technologies are in use at some of the recent large scale supercomputers around the world, including the top TOP500 platforms. The session will discuss the InfiniBand In-Network Computing technology and performance results, as well as glance into future roadmap.


Bio

Dror Goldenberg

Dror joined Mellanox as an architect in 2000 to work on exciting network innovations. Dror drove silicon and system architecture of multiple generations of NICs, Switches and SoCs. Dror’s main focus nowadays is on software architecture, enabling network accelerations of cool technologies like artificial intelligence, HPC, cloud, storage, big data, security and more. He has organized more than a dozen Hackathons and he is passionate about innovations. Dror holds several patents in the field of high speed networking. He graduated Cum Laude with a B.Sc. in Electrical Engineering and holds an MBA from the Technion Institute of Technology Israel

Abstract

Tofu interconnect D is a new interconnect that will be used in the post-K system. TofuD is designed to build a highly scalable system with high-density nodes. TofuD inherits the communication functions of Tofu2 and enhances resilience with a new packet transfer technique called dynamic packet slicing. This presentation shows the design and the preliminary evaluation results of TofuD.


Bio

Yuichirou Ajima

Yuichiro Ajima is a Senior Architect at Fujitsu Limited, where he develops the Fujitsu’s supercomputer systems, including the K computer. He has developed the Tofu interconnect series, and he was awarded the Imperial Invention Prize in 2014. He received the B.S. degree in Electrical Engineering, and the M.S. degree and Ph.D. in Information Engineering from the University of Tokyo in 1997, 1999, and 2002, respectively.

11:00 - 11:30

Break

Abstract

Many areas of science and engineering have adopted simulation-based research as a novel tool for discovery and insight. The sustained performance growth in supercomputer performance allows ever more detailed models, which makes supercomputing nowadays also a viable tool for biology. However, the heterogeneity of biological systems challenges many aspects of scientific computing: intricate workflows are required for model generation, mathematical formulations are volatile, and memory requirements are demanding. At the same time, the weak scaling properties of many biological systems are enormous and therefore are a good match for today’s massive parallelism in supercomputers, whereas the multiple time scales inherent to biological systems requires outside-the-box thinking. The EPFL Blue Brain Project has been pushing the boundaries of the size, complexity and biological faithfulness of brain tissue simulations. This talk will give an overview of the computational and networking challenge, today’s solutions and a discussion of future paths for exploration.


Bio

Felix Schürmann

Felix Schürmann is adjunct professor at the Ecole Polytechnique Fédérale de Lausanne and co-director of the EPFL Blue Brain Project. He directs the computing research of the Blue Brain Project, overseeing more than 60 scientists and engineers in areas such as High Performance Computing, Scientific Visualization and Scientific Software Engineering. His research interests are how computing can help neuroscience and how neuroscience can help computing in the post-Moore era. He studied physics at the University of Heidelberg, Germany, supported by the German National Academic Foundation. Later, as a Fulbright Scholar, he obtained his Master's degree (M.S.) in Physics from the State University of New York, Buffalo, USA, under the supervision of Richard Gonsalves. During these studies, he became curious about the role of different computing substrates and dedicated his master thesis to the simulation of quantum computing. He studied for his Ph.D. at the University of Heidelberg, Germany, under the supervision of Karlheinz Meier. For his thesis he co-designed an efficient implementation of a neural network in hardware.

Abstract

KISTI-5 system, Nurion, is one of the largest commodity cluster system in the world providing national supercomputing resources to users from academia and industry in Korea. The system consists of 8,305 Intel KNL based nodes and 132 Intel SKX 2 sockets nodes connected by 100Gbps Intel OPA interconnect as well as 20PB storage and 800GB IME burst buffer. It reached 13.92 PFlops in HPL benchmarks and ranked #11 in top500 in June 2018, and obtained high ranks in other benchmark lists such as IO500, HPCG, and Graph500. In this talk, we will give an overview of the KISTI-5 system and its complexities and challenges in operation’s perspective.


Bio

Tae Young Hong

Tae Young Hong is a Senior Researcher and Director of Supercomputing Infrastructure Center, Division of National Supercomputing, Korea Institute of Science and Technology Information (KISTI). He leads Supercomputing Infrastructure Center whose main roles are acquiring, deploying and operating national HPC systems in periodic time frames as well as its technical support as the national supercomputing resources to Korean researchers from academia and industry. His research interests are reliability engineering and effective operation of large scale HPC system, and high performance interconnect since he joined KISTI in 2003. He received B.S and M.S in physics at the Sungkyunkwan University, Korea.

Abstract

AI Bridging Cloud Infrastructure (ABCI) is the world's first large-scale Open AI Computing Infrastructure, constructed and operated by National Institute of Advanced Industrial Science and Technology (AIST), Japan. It delivers 19.9 petaflops of HPL performance and world' fastest training time of 74.7 seconds in ResNet-50 training on ImageNet datasets. The underlying architecture design of ABCI follows existing GPU-based supercomputers, such as TSUBAME2, Titan, and their successors. The Fujitsu-built system, ABCI, is powered by 2,176 Intel Xeon Gold Scalable Processors, 4,352 NVIDIA Tesla V100 GPUs, and dual-rail Infiniband EDR interconnects. Moreover, ABCI introduces various novel features and technologies, such as container-based software management, optimized network interconnects with scalable horizontal bandwidth, and hierarchical storage I/O architecture including NVMe Flash, BeeOND-based on-demand parallel file system, global parallel file system, and S3-mimic cloud object storage. We discuss networking, communication, and I/O architecture for the ABCI system and how it is helping a range of applications.


Bio

Hirotaka Ogawa

Hirotaka Ogawa, Ph.D. is Director of AIST-Tokyo Tech Real-World Bigdata Computation Open Innovation Laboratory (RWBC-OIL) and concurrently Leader of AI Cloud Research Team, AI Research Center (AIRC), National Institute of Advanced Industrial Science and Technology (AIST), Japan. He also serves Technical Lead of AI Bridging Cloud Infrastructure (ABCI) and continues to supervise the design and implementation of ABCI as well as its operation. Hirotaka Ogawa receives B.Eng. and M.Eng. degrees from the University of Tokyo, and Ph.D. in Science from Tokyo Institute of Technology.

1:00 - 2:00

Lunch

Abstract

High Performance Computing has long been the domain of specialized supercomputers, designed to run tightly coupled parallel workloads. In the last 20 years, we have seen a migration from custom processors and memory systems to the near-ubiquity of more general purpose processors and off-the-shelf memory systems. The network has become the differentiator between a scalable supercomputer and a loosely coupled cluster. Amazon Web Services recently introduced the Elastic Fabric Adapter (EFA), seeking to combine the properties necessary to support large scale, tightly coupled HPC workloads with the flexible compute model that has made Cloud Computing so popular in enterprise workloads. In this talk, we will present an overview of the EFA design, initial performance characterization, and the trade-offs that lead to an HPC network designed with the cloud in mind.


Bio

Brian Barrett

Brian Barrett is a Principal Engineer in the High Performance Computing group at Amazon Web Services, leading the Elastic Fabric Adapter project. He has also helped design components of the AWS Nitro system, a hardware-offload of common virtualized devices for AWS’s EC2 offering. Prior to joining AWS 5 years ago, Brian was a member of the Scalable Systems Software group at Sandia National Laboratories, where he worked on the Portals 4 network interface and the one-sided communication chapter of the MPI-3 specification. Brian is one of the founding members of the Open MPI project and received his PhD in Computer Science from Indiana University.

Abstract

Exascale supercomputing programs are occurring at a time when a small number of hyperscalers, like Microsoft Azure, are building massive digital footprints that span and connect across the planet. As such, how a Public Cloud company thinks about Exascale objectives, challenges, and opportunities depends on which dimension of "exascale" one is considering. This talk will cover several such "exascale" attributes in the context of the global Azure architecture, and what they suggest the future of a Public Cloud may be to support humanity’s largest computational workloads.


Bio

Evan Burness

Evan Burness is a Principal Engineer in the Specialized Computing division of Microsoft’s Azure. There, he drives strategic and architectural initiatives around high-performance computing (HPC) infrastructure offered on the Azure Cloud, with a focus on application parallelism, network and I/O scalability, and scientific workflows. Since joining Azure in 2017, he has launched two different fleets of HPC resources into Azure’s global network of datacenters, including the HB and HC series VMs that are the first on the Public Cloud to scale a tightly-coupled MPI workloads to 10,000 cores. Evan also leads Specialized Computing’s technology definition process with external vendors, developers, and platform partners.

Prior to joining Azure, Evan served as the Director of High-Performance Computing at Cycle Computing, and for eight years as the Program Management lead for the Private Sector Program at the National Center for Supercomputing Applications (NCSA) at the University of Illinois driving HPC cluster architecture for services to industry clients.

He holds a MS in Technology Management from the University of Illinois, and BA in Public Policy from Duke University.

Abstract

MPI will likely continue to be the dominant infrastructure used for parallel programming and network communication in high-performance computing for the next several years. However, as the commercial data center and cloud computing communities begin to re-discover the fundamentals of HPC networking, and as the traditional HPC workload expands to include more than just modeling and simulation applications executing on a space-shared batch-scheduled machine, MPI will need to move beyond its current capabilities. This talk will offer a perspective on aspects of future more heterogeneous systems relevant to MPI and describe capabilities that MPI must provide to meet the future needs of HPC systems and workloads.


Bio

Ron Brightwell

Ron Brightwell leads the Scalable System Software Department at Sandia National Laboratories. After joining Sandia in 1995, he was a key contributor to the high-performance interconnect software and lightweight operating system for the world’s first terascale system, the Intel ASCI Red machine. He was also part of the team responsible for the high-performance interconnect and lightweight operating system for the Cray Red Storm machine, which was the prototype for Cray’s successful XT product line. The impact of his interconnect research is visible in technologies available today from Atos/Bull, Intel, and Mellanox. He has also contributed to the development of the MPI-2 and MPI-3 specifications. He has authored more than 115 peer-reviewed journal, conference, and workshop publications. He is an Associate Editor for the IEEE Transactions on Parallel and Distributed Systems, has served on the technical program and organizing committees for numerous high-performance and parallel computing conferences, and is a Senior Member of the IEEE and the ACM.

Software and Hardware co-design for low-power HPC platforms

Manolis Ploumidis, Nikolaos Kallimanis, Marios Asiminakis, Nikolaos Chrysos, Vassilis Papaustathiou, Pantelis Xirouchakis, Michalis Gianioudis, Nikolaos Dimou, Antonis Psistakis, Panagiotis Peristerakis, Manolis Katevenis
Foundation For Research & Technology -- Hellas (FORTH)
Abstract

With cluster computation power moving towards exascale, cost both in terms of installation and operation plays a significant role in future data centers and HPC clusters. In order to keep an HPC cluster viable in terms of economy, serious cost limitations on the hardware and software deployment should be considered. This imposes to reconsider the design of modern HPC communication architectures.

In this paper we present a cross layer communication architecture suitable for emerging HPC platforms based on heterogeneous multiprocessors. The proposed architecture introduces simple hardware blocks that allow integration in the same package with the processing unit. Apart from hardware blocks, our communication architecture includes a user-space software stack that allows user-level access to hardware blocks. The proposed architecture is able to provide efficient, low-latency, and reliable communication mechanisms to HPC applications. Moreover, we demonstrate an efficient implementation of the MPI standard that provides point-to-point and collective primitives that fully exploit the hardware capabilities. The proposed MPI implementation introduces low overheads, while the exploitation of the eager protocol gives us an end-to-end latency of 1.4 usec.

4:00 - 4:30

Break

Panel Members

Holger Fröning, Heidelberg University, Germany.

Tobias Kenter, Paderborn Center for Parallel Computing (PC2), Paderborn University, Germany.

Iakovos Mavroidis, Institute of Computer Science (ICS), Foundation for Reaserch and Technology -- Hellas (FORTH), Greece.

Javier Navaridas Palma, School of Computer Science, University of Manchester, UK.

Piero Vicini, National Institute for Nuclear Physics (INFN), Italy.