Extreme Scale computing in HPC, Big Data, Deep Learning and Clouds are marked by multiple-levels of hierarchy and heterogeneity ranging from the compute units (many-core CPUs, GPUs, APUs etc) to storage devices (NVMe, NVMe over Fabrics etc) to the network interconnects (InfiniBand, High-Speed Ethernet, Omni-Path etc). Owing to the plethora of heterogeneous communication paths with different cost models expected to be present in extreme scale systems, data movement is seen as the soul of different challenges for exascale computing. On the other hand, advances in networking technologies such as NoCs (like NVLink), RDMA enabled networks and the likes are constantly pushing the envelope of research in the field of novel communication and computing architectures for extreme scale computing. The goal of this workshop is to bring together researchers and software/hardware designers from academia, industry and national laboratories who are involved in creating network-based computing solutions for extreme scale architectures. The objectives of this workshop will be to share the experiences of the members of this community and to learn the opportunities and challenges in the design trends for exascale communication architectures.
ExaComm welcomes original submissions in a range of areas, including but not limited to:
We outline the most significant challenges for building high-performance networks for exascale systems and discuss desirable network attributes for potential workloads. We cover recent network trends, and then examine several promising ideas and directions for addressing the exascale challenges and desirable attributes, particularly in the areas of technology, topologies, protocols, and support for offloaded remote transactions.
Craig Stunkel is a Principal Research Staff Member at IBM's T. J. Watson Research Center in Yorktown Heights, NY, and is currently Network Lead in the Data Centric Systems department that architected the CORAL systems being delivered to two U.S. national laboratories. He received the B.S. and M.S. degrees in electrical engineering from Oklahoma State University in 1982 and 1983, and the Ph.D. degree in electrical engineering from the University of Illinois, Urbana in 1990.
After joining IBM Research in 1990, he contributed to the network architecture and design of every generation of the IBM SP supercomputer line. He later served as Senior Manager of Deep Computing Software during the development of the IBM Blue Gene supercomputer line. He has received four IBM Outstanding Technical Achievement Awards for his contributions to IBM supercomputers. He holds 14 U.S. patents related to switching networks.
Dr. Stunkel is a Fellow of the IEEE.
Arm technologies present exciting possibilities for co-design of large-scale HPC systems. The U.S. Department of Energy (DOE) National Nuclear Security Administration is embarking on an effort to mature the Arm ecosystem for its Advanced Simulation and Computing (ASC) workloads, with a large-scale system deployment planned. This talk will present the key software stack requirements for this system and our plans for developing an efficient and productive ARMv8 programming environment. Focus areas include overall integration and robustness of the stack, network stack performance, addressing known challenges such as compilers and math libraries, and improving support for current HPC workflows and emerging integrated HPC/Data Analytic Computing (DAC) workloads. Working in collaboration with the Arm community and system vendors, our desire is to create an open and integrated software stack that is scalable to the largest supercomputers at DOE and elsewhere.
Kevin Pedretti is a Principal Member of Technical Staff at Sandia National Laboratories in the Center for Computing Research. His research is centered On scalable system software for extreme-scale parallel computing platforms, with specific focus on lightweight operating systems, networking, and power management. He is leading an effort to mature the Arm software stack ecosystem for U.S. NNSA/ASC computing with a large-scale Arm64 system deployment planned in 2018.
The latest revolution in HPC and AI is the effort around the co-design collaboration, a collaborative effort among industry thought leaders, academia, and manufacturers to reach Exascale performance by taking a holistic system-level approach to fundamental performance improvements. Co-design recognizes that the CPU has reached the limits of its scalability, and offers an intelligent network as the new “co-processor” to share the responsibility for handling and accelerating application workloads. The session will describe the latest technology development and performance results from latest large scale deployments.
Dror Goldenberg has served as Mellanox’s vice president of software architecture since October 2013. Previously, Mr. Goldenberg served as vice president of architecture from March 2010 to October 2013, where he was responsible for software, firmware and system architecture. Mr. Goldenberg joined Mellanox in 2000 as an architect, serving in numerous architecture and design positions. Prior to Mellanox, Mr. Goldenberg held various software development positions on an elite R&D team in the Israeli Defense Force. Prior to that, Mr. Goldenberg was a member of the Pentium MMX and Pentium 4 architecture validation teams at Intel. He contributed to the PCI and InfiniBand specifications and holds several patents in the field of high speed networking. Mr. Goldenberg graduated Cum Laude with a B.Sc. in Electrical Engineering and holds an MBA from the Technion Institute of Technology Israel.
Artificial Intelligence and High Performance Data Analytics applications in the Cloud are being fed by a deluge of data emanating from the Internet connected population and devices. Workloads in the cloud are beginning to demand high performance from their data center fabrics to handle “east-west” communication.
This talk highlights the broadening role of Fabrics in general, and the Open Fabrics Interface in particular to rise to the challenge of meeting the emerging semantic requirements of applications on a converged platform.
Sayantan Sur is a Software Engineer at Intel Corp, in Hillsboro, Oregon. His work involves High Performance computing, specializing in scalable interconnection fabrics and Message passing software (MPI). Before joining Intel Corp, Dr. Sur was a Research Scientist at the Department of Computer Science and Engineering at The Ohio State University. In the past, he has held a post-doctoral position at IBM T. J. Watson Research Center, NY. He has published more than 20 papers in major conferences and journals related to these research areas. Dr. Sur received his Ph.D. degree from The Ohio State University in 2007.
ABCI, constructed and operated by National Institute of Advanced Industrial Science and Technology (AIST) in Japan, is an open innovation platform with 0.55 Exa Flops for AI and 37 Peta Flops for HPC of world-class computing resources for AI R&D through industry and academia collaboration. The basic architecture of ABCI is quite similar to modern supercomputing systems, but ABCI introduces various novel features and technologies, such as container-based software management, optimized network interconnects with scalable horizontal bandwidth, and deep hierarchical memory/storage including flash SSD, parallel file systems, and object-based campaign storage, etc., for supporting scalable AI/Big Data. We discuss the current status, issues and problems for the convergence of HPC and AI/Big Data from our experiences.
Hitoshi Sato is a senior research scientist at Artificial Intelligence Research Center (AIRC) of National Institute of Advanced Industrial Science and Technology (AIST) .He joined AIRC in 2016 and has been involved with several AI/HPC related projects such as ABCI (AI Bridging Cloud Infrastructure). His research interests include high-performance data-intensive computing for future extreme-scale supercomputers and cloud data centers.
This presentation analyses the essence of DataFlow SuperComputing, defines its advantages and sheds light on the related programming model. DataFlow computers, compared to ControlFlow computers, offer: (a) Speedups of 20 to 200 (depends on the algorithmic characteristics of the most essential loops and the spatial/temporal characteristics of the Big Data Streem, etc.), (b) Potentials for a better precission (depends on the characteristics of the optimizing compiler and the operating system, etc.), (c) Power reductions of about 20 (depends on the clock speed and the internal architecture, etc.), and (d) Size reductions of also about 20 (depends on the chip implementation and the packiging technology, etc.). However, the programming paradigm is different, and has to be mastered.
The talk explains the paradigm, using Maxeler as an example, and sheds light on the ongoing research, which, in the case of the speaker, was higlhy influenced by four different Nobel Laureates: (a) from Richard Feynman it was learned that future computing paradigms will be successful only if the amount of data communications is minimized; (b) from Ilya Prigogine it was learned that the entropy of a computing system would be minimized if spatial and temporal data get decoupled; (c) from Daniel Kahneman it was learned that the system software should offer options related to approximate computing; and (d) from Andre Geim it was learned that the system software should be able to trade between latency and precision. The presentation concludes with the latest achievements of Maxeler Technologies in the current year.
Prof. Veljko Milutinovic (1951) received his PhD from the University of Belgrade, spent about a decade on various faculty positions in the USA (mostly at Purdue University and the Indiana University in Bloomington), and was a co-designer of the DARPAs first GaAs RISC microprocessor and the DARPAs first GaAs Systolic Array (both well documented in the open literature). Later, for about two decades, he taught and conducted research at the University of Belgrade, in EE, MATH, BA, and PHYS/CHEM. Now he serves as the Chairman of the Board for the Maxeler operation in Belgrade, Serbia and the Chairman of the Board of MECOnet.me in Podgorica, Montenegro.
His research is mostly in datamining algorithms and dataflow computing, with the emphasis on mapping of data analytics algorithms onto fast energy efficient architectures. For 8 of his books, forewords were written by 8 different Nobel Laureates with whom he cooperated on his past industry sponsored projects. He has over 40 IEEE journal papers, over 40 papers in other SCI journals (4 in ACM journals), over 400 Thomson-Reuters citations, and about 4000 Google Scholar citations. Short courses on the subject he delivered so far in a number of universities worldwide: MIT, Harvard, Boston, NEU,USC, UCLA, Columbia, NYU, Princeton, NJIT, CMU, Temple, Purdue, IU, UIUC, Michigan, FAU, FIU, EPFL, ETH, TUWIEN, UNIWIE, Karlsruhe, Heidelberg, Stuttgart, Aachen, Napoli, Salerno, Siena, Pisa, etc. Also at the World Bank in Washington DC, State Street, Brookhaven National Laboratory, Lawrence Livermore National Laboratory, IBM TJ Watson, HP Encore Lab, Intel Oregon, Qualcomm VP, Yahoo NY, Google CA, ABB Zurich, Oracle Zurich, etc.
This talk will first cover the technical limitations driving towards the end of copper cables for HPC interconnects. While AOCs (Active Optical Cables) have been a "solution" for some time now, they carry a prohibitive cost structure that limits the enablement of rich and capable interconnects. Co-packaged optics, that are just about to become more mainstream, will enable passive optical cables and an entire new range of possibilities for alternate topologies and multiple fabric planes.
Dr. Nicolas Dubé is the Chief Strategist for High-Performance Computing at Hewlett Packard Enterprise. He is the technical lead of the Advanced Development Team, a “skunkworks” group dedicated to the advancement of core technologies and the redefinition of system architecture for next generation supercomputers. He is also the Chief Architect for Exascale, driving a voluntarily more open and balanced vision. Previously, he lead the Apollo 8000 team as the system architect for NREL’s Peregrine, awarded R&D Awards 2014 “most significant innovation of the year”. Leveraging a combined experience in server and datacenter engineering, Nicolas advocates for a “greener” IT industry, leveraging warm water cooling, heat re-use and carbon neutral energy sources while pushing for dramatically more efficient computing systems.
Dr. Luiz DeRose is a Senior Principal Engineer and the Programming Environments Director at Cray Inc, where he is responsible for the programming environment strategy for all Cray systems. Before joining Cray in 2004, he was a research staff member and the Tools Group Leader at the Advanced Computing Technology Center at IBM Research. Dr. DeRose had a Ph.D. in Computer Science from the University of Illinois at Urbana-Champaign. With more than 25 years of high performance computing experience and a deep knowledge of its programming environments, he has published more than 50 peer-review articles in scientific journals, conferences, and book chapters, primarily on the topics of compilers and tools for high performance computing.
Architecture simulation can aid in predicting and understanding application performance, particularly for proposed hardware or large system designs that do not exist.In network design studies for high-performance computing, most simulators focus on the dominant message passing (MPI) model. Currently, many simulators build and maintain their own simulator-specific implementations of MPI. This approach has several drawbacks. Rather than reusing an existing MPI library, simulator developers must implement all semantics, collectives, and protocols. Additionally, alternative runtimes like GASNet cannot be simulated without again building a simulator-specific version. It would be far more sustainable and flexible to maintain lower-level layers like uGNI or IB-verbs and reuse the production runtime code. Directly building and running production communication runtimes inside a simulator poses technical challenges, however. We discuss these challenges and show how they are overcome via the macroscale components for the Structural Simulation Toolkit (SST), leveraging a basic source-to-source tool to automatically adapt production code for simulation. SST is able to encapsulate and virtualize thousands of MPI ranks in a single simulator process, providing a ``supercomputer in a laptop'' environment. We demonstrate the approach for the production GASNet runtime over uGNI running inside SST. We then discuss the capabilities enabled, including investigating performance with tunable delays, deterministic debugging of race conditions, and distributed debugging with serial debuggers.
This article introduces ten different tensor operations, their generalizations, as well as their implementations for a dataflow paradigm. Tensor operations could be utilized for addressing a number of big data problems in machine learning and computer vision, such as speech recognition, visual object recognition, data mining, deep learning, genomics, mind genomics, and applications in civil and geo engineering. As the big data applications are breaking the Exascale barrier, and also the Bronto scale barrier in a not so far future, the main challenge is finding a way to process such big quantities of data. This article sheds light on various dataflow implementations of tensor operations, mostly those used in machine learning. The iterative nature of tensor operations and a large amount of data makes them situable for the dataflow paradigm. All the dataflow implementations are analyzed comparatively with the realated control-ow implementations, for speedup, complexity, power savings, and MTBF. The core contribution of this paper is a table that compare the two paradigms for various data set sizes, and in various conditions of interest. The results presented in this paper are made to be applicable both for the current dataflow paradigm implementations and for what we beleive are the optimal future dataflow paradigm implementations, which we refer to as the Ultimate dataflow. This portability was made possible because the programming model of the current dataflow implementation is applicable also to the Ultimate dataflow. The major diferences between the Ultimate dataflow and the current dataflow implementations are not in the programming model, but in the hardware structure and in the capabilities of the optimizing compiler. In order to show the differences between the Ultimate dataflow and the current dataflow implementations, and in order to show what to expect from the future dataflow paradigm implementations, this paper starts with an overview of Ultimate dataflow and its potentials.
Luiz DeRose, Senior Principal Engineer and Programming Environments Director, Cray.
Torsten Hoefler, Associate Professor, ETH Zürich.
Bernd Mohr, Institute for Advanced Simulation (IAS), Jülich Supercomputing Centre (JSC).
Sameer Shende, Director of the Performance Research Lab, NIC, University of Oregon.
Anthony Skjellum, Professor, The University of Tennessee at Chattanooga.