Extreme Scale computing in HPC, Big Data, Deep Learning and Clouds are marked by multiple-levels of hierarchy and heterogeneity ranging from the compute units (many-core CPUs, GPUs, APUs etc) to storage devices (NVMe, NVMe over Fabrics etc) to the network interconnects (InfiniBand, High-Speed Ethernet, Omni-Path etc). Owing to the plethora of heterogeneous communication paths with different cost models expected to be present in extreme scale systems, data movement is seen as the soul of different challenges for exascale computing. On the other hand, advances in networking technologies such as NoCs (like NVLink), RDMA enabled networks and the likes are constantly pushing the envelope of research in the field of novel communication and computing architectures for extreme scale computing. The goal of this workshop is to bring together researchers and software/hardware designers from academia, industry and national laboratories who are involved in creating network-based computing solutions for extreme scale architectures. The objectives of this workshop will be to share the experiences of the members of this community and to learn the opportunities and challenges in the design trends for exascale communication architectures.

All times in Eastern Day Time (EDT)

Workshop Program

8:00 - 9:00 AM

Keynote

Speaker: Satoshi Matsuoka, RIKEN Center for Computational Science, Japan

Title: Fugaku and its Advanced Network Features for Disaggregation

Abstract: Fugaku is currently the fastest supercomputer in the world, where its technical innovations and advancement is not only in the CPU itself but also in its interconnect. In fact, Fugaku not only incorporates the Tofu-D network interface and the associated DMAC, but also the 10-port switch that comprise the 6D Torus network. Since the network interface is directly connected to the intra-chip interconnect ring that also connect to memory and the L2 cache, Tofu-D network allows composition of disaggregated architecture, in that any memory in the system can be directly accessed from any CPU via RDMA, and the data be injected into the L2 cache. Such features allow for very low latency communication for MPI, especially sub-microsecond one-sided communication, but we also expect distributed shared memory features can be implemented efficiently on Fugaku, possibly matching performance of hardware-based NUMA machines.

Speaker Bio: Satoshi Matsuoka from April 2018 has become the director of Riken CCS, the top-tier HPC center that represents HPC in Japan, developing and hosting Japan’s tier-one ‘Fugaku’ supercomputer which has become the fastest supercomputer in the world in all four major supercomputer rankings, along with multitudes of ongoing cutting edge HPC research being conducted, including investigating Post-Moore era computing.

He had been a Full Professor at the Global Scientific Information and Computing Center (GSIC), the Tokyo Institute of Technology since 2000, and the director of the joint AIST- Tokyo Tech. Real World Big Data Computing Open Innovation Laboratory (RWBC- OIL) since 2017, and became a Specially Appointed Professor at Tokyo Tech in 2018 along with his directorship at R-CCS.

He has been the leader of the TSUBAME series of supercomputers that have won many accolades such as world #1 in power-efficient computing. He also leads various major supercomputing research projects in areas such as parallel algorithms and programming, resilience, green computing, and convergence of big data/AI with HPC.

He has written over 500 articles according to Google Scholar, and chaired numerous ACM/IEEE conferences, including the Program Chair at the ACM/IEEE Supercomputing Conference (SC13) in 2013. He is a Fellow of the ACM and European ISC, and has won many awards, including the JSPS Prize from the Japan Society for Promotion of Science in 2006, presented by his Highness Prince Akishino; the ACM Gordon Bell Prize in 2011; the Commendation for Science and Technology by the Minister of Education, Culture, Sports, Science and Technology in 2012; the 2014 IEEE-CS Sidney Fernbach Memorial Award, the highest prestige in the field of HPC; HPDC 2018 Achievement Award from ACM; and recently SC Asia 2019 HPC Leadership Award.

9:00 - 9:30 AM

Speaker: Gilad Shainer, NVIDIA/Mellanox

Title: Cloud Native Supercomputing 

PDF

Abstract: High performance computing and Artificial Intelligence are the most essential tools fueling the advancement of science. In order to handle the ever growing demands for higher computation performance and the increase in the complexity of research problems, the world of scientific computing continues to re-innovate itself in a fast pace. The session will review the recent development of the cloud native supercomputing architecture, aiming on bringing together bare metal performance and cloud services.

Speaker Bio: Gilad Shainer serves as senior vice-president of marketing for Mellanox networking at NVIDIA, focusing on high- performance computing, artificial intelligence and the InfiniBand technology. Mr. Shainer joined Mellanox in 2001 as a design engineer and later served in senior marketing management roles since 2005. Mr. Shainer serves as the chairman of the HPC-AI Advisory Council organization, the president of UCF and CCIX consortiums, a member of IBTA and a contributor to the PCISIG PCI-X and PCIe specifications. Mr. Shainer holds multiple patents in the field of high-speed networking. He is a recipient of 2015 R&D100 award for his contribution to the CORE-Direct In-Network Computing technology and the 2019 R&D100 award for his contribution to the Unified Communication X (UCX) technology. Gilad Shainer holds a MSc degree and a BSc degree in Electrical Engineering from the Technion Institute of Technology in Israel.

9:30 - 10:00 AM

Speaker: Duncan Roweth, HPE

Title: Slingshot Network and its use in Exascale Systems

Abstract: This talk will provide an overview of the Slingshot Ethernet network fabric developed by Cray and being used by HPE in the US Exascale Systems. The talk will focus on aspects of the design that are important for exascale systems: scalability, support for cost effective network topologies, performance under load, and low-latency network collectives.

Speaker Bio: Duncan Roweth is a Distinguished Technologist in the HPC and Mission Critical Business Unit CTO office at HPE. He joined HPE in Jan 2020 with the acquisition of Cray. While at Cray he worked on three generations of HPC network. He has been in a leading figure in the Slingshot program since its inception. He is currently working on design of 2nd and 3rd generation Slingshot products for future HPE systems. Duncan holds a Ph.D. from the University of Edinburgh.

10:00 - 10:30 AM

Speaker: Phil Murphy, Cornelis Networks

Title: Cornelis Networks Omni-Path: Purpose Built High-Performance Fabrics for HPC/HPDA/AI

PDF

Abstract: The convergence of traditional HPC modeling/simulation, HPDA and AI on a single compute cluster brings new challenges to the fabric design but the required fundamental interconnect performance characteristics of low latency, extreme message rate, and scalable bandwidth remain paramount. Phil will discuss the fabric design trade-offs required to deliver these fundamentals. He will also discuss how Cornelis Networks is leveraging libfabric/OpenFabrics Interfaces to improve real application performance while taking advantage of industry-wide innovations in communications libraries and programming frameworks.

Speaker Bio: As CEO of Cornelis Networks, Phil is responsible for the overall management and strategic direction of the company. Prior to co-founding Cornelis Networks, Phil served as a director at Intel Corporation, responsible for fabric platform planning and architecture, product positioning, and business development support. Prior to that role, Phil served as vice president of engineering and vice president of HPC technology within QLogic’s Network Solutions Group, responsible for the design, development, and evangelizing of all high-performance computing products, as well as all storage area network switching products. Before joining QLogic, Phil was vice president of engineering at SilverStorm Technologies, which he co-founded in 2000 and which was acquired by QLogic in 2006. SilverStorm’s core focus was on providing complete network solutions for high performance computing clusters. Prior to co-founding SilverStorm, Phil served as director of engineering at Unisys Corporation and was responsible for all I/O development across the company’s diverse product lines.

Phil holds a BS in Mathematics from St. Joseph’s University and an MS in Computer and Information Science from the University of Pennsylvania.

10:30 AM - 11:00 AM

Speakers: Hemal Shah and Moshe Voloshin, Broadcom

Title: RoCEv2 Congestion Control Enhancements for Large Scale HPC and ML Deployments

PDF

Abstract: With the availability of 100 Gbps Ethernet and RoCEv2, Ethernet is replacing InfiniBand in the High Performance Computing (HPC) and Machine Learning (ML) environments. There is a perception that RoCEv2 does not scale well and requires Priority-based Flow Control (PFC). The use of PFC only can lead to head-of-line blocking that results in traffic interference. In this talk, we will show that RoCEv2 with ECN-based congestion control schemes scales well without requiring PFC. We will discuss congestion control enhancements including switch buffer ECN thresholds, ECN marking/CNP generation in NICs, and hardware-based congestion control that can further improve RoCEv2 performance in large scale deployments.

Hemal Shah Bio: Hemal Shah is a Distinguished Engineer and Systems/Software/Standards architect in the Data Center Solutions Group (DCSG) division at Broadcom Inc. He leads and manages a team of architects. Hemal is responsible for the definition of product architecture and software roadmap/architecture of all product lines of Ethernet NICs. Hemal led the architecture definition of several generations of NetXtreme E-Series/NetXtreme I server product lines and NetXtreme I client product lines. Hemal spearheaded the system architecture development of TruFlow technology for vSwitch acceleration/packet processing software frameworks, TruManage technology for system and network management, device security features, virtualization and stateless offloads. Hemal has defined the system architecture of RDMA hardware/software solutions for more than two decades.

Before joining Broadcom in 2005, Hemal worked at Intel Corporation where he led the development of system/silicon/software architecture of communication processors, 10 Gigabit Ethernet controllers, TCP/iSCSI/RDMA offloads, and IPsec/SSL/firewall/VPN accelerations. Hemal is the lead technical representative/contributor from Broadcom Inc. in the Open Compute Project (OCP) and Distributed Management Task Force (DMTF). Hemal serves as Senior VP of Technology in the DMTF and a project co-lead of OCP Hardware Management project. Hemal has co-authored several OCP specifications, 70+ DMTF specifications, four IETF RFCs, and 10 plus technical conference/journal papers. Hemal is a named inventor on 40+ patents with several pending patents. Hemal holds Ph. D. (computer engineering) and M.S. (computer science) degrees from Purdue University, M.S.E.E. degree from The University of Arizona, and B.S. (electronics and communication engineering) degree from Gujarat University, India.

Moshe Voloshin Bio: Moshe Voloshin is Systems architect in Data Center Solutions Group (DCSG) division at Broadcom Inc. Moshe spearheaded the system architecture development of ROCE and Congestion Control in Broadcom Ethernet NICs, involved in definition of product architecture, modeling, and system simulations.

Previously Moshe was a Director, manager, ASIC/HW engineer at Cisco High End router division where he developed and managed the development of Network Processing Unit (NPU), QOS, and fabric ASICs, in products such as GSR and CRS.

11:00 - 11:30 AM

Speaker: Matthew Williams, Rockport Networks

Title: Upcoming ultra-low latency direct interconnect switchless RoCE adapter and associated software and programming models from Rockport Networks for HPC

PDF

Abstract: Poor workload performance due to network congestion is a well-understood challenge for HPC. Rockport Networks has developed an ultra-low latency direct interconnect switchless networking solution that delivers consistently low latency, even under heavy load from competing noisy-neighbor workloads. Rockport has partnered with OSU to develop new capabilities in their high performance MVAPICH library to take advantage of Rockport unique solution. Matthew will provide an overview of Rockport’s innovative distributed switching architecture, show best practices for benchmarking to predict performance in production environments and share benchmark results illustrating Rockport’s excellent loaded latency characteristics.

Speaker Bio: Matthew Williams is CTO of Rockport Networks and has 25 years of technical leadership and engineering experience, 14 years as CTO of successful network technology companies and has 21 issued US patents. He is an expert strategist, analyst and visionary who has delivered on transformational product concepts. Matthew is an insightful and energetic communicator who enjoys product evangelization and inspiring global business and technical audiences.

Matthew has a B.Sc. in Electrical Engineering with First Class Honours from Queen’s University, Kingston, Canada and is a registered P.Eng.

11:30 AM - 12:00 PM

Speaker: Sanjay Basu, Oracle

Title: Architecting AI Services for Media industry using Oracle HPC Cloud

PDF

Abstract: In this presentation, I will dive deep in architectural patterns, options and constraints when designing the platform to deliver and process AI services for autonomous video transcoding, video classification and on-demand transcribing. I will touch upon various network services, compute instances and storage options used to deliver this solutions. I will cover benchmarks between various NVIDIA GPUs (Pascal 100, Volta 100 and Ampere 100) and CPUs (Intel Skylake, Intel Icelake, AMD E3 and E4) for transcoding and ML training using RDMA backend and how to optimize the architecture for better performance including various Oracle Cloud based Content Delivery Network options. Additionally, I will touch upon the cloud native open source schedular and our event-driven flow based architecture for this family of applications.

Speaker Bio: Sanjay is currently serving as Director Cloud Engineering at Oracle. His focus area is Machine Learning on HPC/GPU, Data Science Platform and Blockchain-As-A-Service on Oracle’s 2nd Generation Bare Metal Cloud Computing Services. His most recent contribution is creating the set of " Validated Solution Guides" for "Architecting Deep Learning on Oracle Cloud IaaS for Autonomous Driving". Sanjay has 28 years of progressive experience in Information Technology related to Security, Cryptography and Infrastructure-as-a-Service. He has the distinction of being the lead architect for Dell Services first ever Private Cloud launched in 2008 and Chief Network Architect for Dell’s vCloud Public Offering in 2010. His past roles also include being field CTO for VCE/EMC’s Managed Cloud Services for Converged Systems and an executive consultant for AWS ProServe Enterprise Advisory Services. Additionally Sanjay has served in advisory boards of companies focusing in Database Load-balancing and advanced trainings for virtualization and AI/ML. Sanjay is an alumnus of FinTech professionals from Oxford Said Business School. Sanjay has successfully completed advanced certificates on Design Thinking from Harvard, , Masters in Systems Design and Advanced courses in AI/ML from MIT and Professional MBA from Boston University. Sanjay is also an alumnus of Amazon Machine Learning University. Additionally he is an adjunct professor with Divergence Academy in Dallas. He has been awarded 5 Patents, so far. He is a regular speaker/presenter at Certified Information Security Conference, Quantum Optics and Computing, Annual Hamburg AIML Startup Conference, Oracle Open World, etc. His upcoming book is being published from Apress. He is a life member of ACM , SIAM, AAAI and Senior Member of IEEE and AMA. Please connect with Sanjay here. His Oracle Cloud related blogs can be found at here.

Organizing Committee


Program Chairs

Program Committee