Many Big Data processing system software are gaining momentum in the industry. Apache Hadoop and Spark have become as standard tools in handling Big Data and analytics in IT companies. Similarly, Memcached in Web-2.0 environment is becoming important for large-scale query processing. Recent studies have shown that the current-generation Hadoop, Spark, and Memcached can not leverage the high-performance networking and storage architectures on modern HPC clusters efficiently, like Remote Direct Memory Access (RDMA) enabled high-performance interconnects and heterogeneous and high-speed storage systems (e.g. HDD, SSD, NVMe-SSD, NVRAM, and Lustre). These system software are traditionally written with sockets and do not deliver the best performance on modern high-performance networks. In this BoF, we will organize several talks to give an in-depth overview of the architecture of popular Big Data processing system software (e.g., Hadoop, Spark, Flink, Memcached, etc.). All the speakers and the audience will be involved to identify the most critical challenges currently facing the community in re-designing the internal components of these system software with modern interconnects, protocols (such as InfiniBand, iWARP, and RoCE) with RDMA, accelerators, and storage architectures. We will also solicit all kinds of feedback from the community to come up with a roadmap for the next 5–10 years about how to efficiently handle these grand challenges associated with Big Data processing over modern HPC clusters.
Dhabaleswar K. (DK) Panda, Professor and Distinguished Scholar, The Ohio State University
Xiaoyi Lu, Research Scientist, Ohio State University
Richard Graham, Senior Solutions Architect, Mellanox Technologies
Francis Lam, Senior HPC Solution Architect, Huawei Slides
Xiaoyi Lu, Research Scientist, Ohio State University Slides
Yutong Lu, Director, System Software Laboratory, School of Computer Science, National University of Defense Technology Slides
John Shalf, CTO for the National Energy Research Supercomputing Center & Department Head for Computer Science & Data Sciences, Lawrence Berkeley National Laboratory Slides
This BoF is targeted for various categories of people working in the areas of HPC and Big Data. The specific audience is aimed at include: - Scientists, engineers, researchers, and students engaged in designing next-generation Big Data system software and applications - Designers and developers of high-performance Big Data system software, such as Hadoop, Spark, and Memcached - Newcomers to the field of Big Data who are interested in familiarizing themselves with system software, RDMA, high-performance networking and storage, accelerator, etc. - Managers and administrators responsible for setting up next generation Big Data environment and high-end systems in their organizations/laboratories.