High Performance MPI on Infiniband Cluster
The MVAPICH2 software, supporting MPI 3.0 standard, delivers the best performance, scalability, and fault tolerance for high-end computing systems and servers using InfiniBand, Omni-Path, Ethernet/iWARP, and RoCE networking technologies. This software is being used by more than 3,125 organizations world-wide in 89 countries to extract the potential of these emerging networking technologies for modern systems.
High-Performance Deep Learning and Machine Learning
The availability of large data sets (e.g. ImageNet, PASCAL VOC 2012) coupled with massively parallel processors in modern HPC systems (e.g. NVIDIA GPUs) have fueled a renewed interest in Deep Learning (DL) and Machine Learning (ML) models. In addition to the popularity of massively parallel DL/ML accelerators like GPUs, the availability and memory-abundance of modern CPUs poses a viable alternative for DL/ML training. This resurgence of DL/ML applications has triggered the development of DL frameworks like PyTorch, TensorFlow, LBANN, and Apache MXNet as well as ML frameworks like Scikit-Learn and cuML. While most DL/ML frameworks provide experimental support for multi-node training, their distributed implementation is often suboptimal. Further, the emergence of distributed DL frameworks such as Horovod and DeepSpeed introduce novel parallelism challenges.
High Performance Runtime for PGAS Models (OpenSHMEM, UPC and CAF)
Partitioned Global Address Space (PGAS) languages are growing in popularity because of their ability to provide shared memory programming model over distributed memory machines. In this model, data can be stored in global arrays and manipulated by individual compute threads. This model shows promise for expressing algorithms that have irregular computation and communication patterns. It is unlikely that such applications written in MPI will be re-written using the emerging PGAS languages in the near future. But, it is more likely that parts of these applications will be converted using newer models. This requires that underlying implementation of system software be able to support multiple programming models simultaneously.
Programming model support for GPU and Accelerators
General purpose Graphical Processing Units (GPUs) are becoming an integral part of modern system architectures. They are pushing the peak performance of the fastest supercomputers in the world and are speeding up a wide spectrum of applications. While the GPUs provide very high peak flops, data movement between host and GPU, and between GPUs continues to remain a bottleneck for both performance and programmer productivity. MPI has been the de-facto standard for parallel application development in the High Performance Computing domain. Many of the MPI applications are being ported to run on clusters with GPUs for higher performance. Our project aims to simplify this task by supporting standard Message Passing Interface (MPI) from GPU device memory through the MVAPICH2 MPI library. While supporting the advanced features of MPI like collective communication, user-defined datatypes and one-sided communication among others, MVAPICH2 aims to optimize the data movement between host and GPU, and between GPUs in the best way possible with minimal or no overhead to the application developer.
BigData (Hadoop,Spark & Memcached)
Apache Hadoop and Spark are gaining prominence in handling Big Data and analytics. Similarly, Memcached in Web 2.0 environment is becoming important for large-scale query processing. These middleware are traditionally written with sockets and do not deliver best performance on datacenters with modern high performance networks. In this tutorial, we will provide an in-depth overview of the architecture of Hadoop components (HDFS, MapReduce, RPC, HBase, etc.), Spark and Memcached. We will examine the challenges in re-designing the networking and I/O components of these middleware with modern interconnects, protocols (such as InfiniBand, iWARP, RoCE, and RSocket) with RDMA and storage architecture. Using the publicly available software packages in the High-Performance Big Data (HiBD, http://hibd.cse.ohio-state.edu) project, we will provide case studies of the new designs for several Hadoop/Spark/Memcached components and their associated benefits. Through these case studies, we will also examine the interplay between high performance interconnects, storage systems (HDD and SSD), and multi-core platforms to achieve the best solutions for these components.
Networking, Storage and Middleware Support for Cloud Computing and DataCenters
Cloud computing environments are paving new ways to provide infrastructure for enterprise environments and datacenters. Modern multi-core computing platforms together with high performance networking technologies (InfiniBand and 10/40GigE iWARP with RDMA and atomic operations) and storage technologies (SSDs) are providing new opportunities to design these environments with performance and scalability. Past research along this direction has focused on designing different RDMA-aware middleware components (such as active caching, distributed shared substrate, load-balancing, reconfiguration and admission control). Current research directions focus on scalable and high-performance designs for Memacached, HBase, HDFS and MapReduce under the Hadoop infrastructure.
High Performance Computing with Virtualization
Although originally focused on resource sharing, current virtual machine (VM) technologies provide a wide range of benefits such as ease of management, system security, performance isolation, checkpoint/restart and live migration. Cluster-based High Performance Computing can take advantage of these desirable features of virtual machines, which is especially important when ultra-scale clusters are posing additional challenges on performance, scalability, system management, and administration of these systems.
Scalable Filesystems and I/O
In modern clusters, I/O is quickly emerging as the main bottleneck limiting performance. The need for scalable parallel I/O and file systems is becoming more and more urgent. The performance of network storage systems is often limited by overhead in the I/O path, such as memory copying, network access costs, and protocol overhead. High speed networking architectures such as InfiniBand and Quadrics create opportunities to address these issues without changing fundamental principles of production operating systems.
Performance Evaluation of Cluster Networking and I/O Technologies
The aim of this project is to design and develop a comprehensive benchmark suite for carrying out performance evaluation of modern cluster networking and I/O technologies.
Fault Tolerance Backplane - InfiniBand
Modern High Performance Computing Systems have grown in size to thousands of nodes, having tens of thousands of processors and thousands of disks, which has resulted in an increase in the rate of failures. Fault Management in such systems has been traditionally handled by individual components at different levels in the stack, with no coordination among different components in the system. As a result, different components cannot benefit from the knowledge of the failures experienced by others. The CiFTS initiative aims to provide a coordinated infrastructure that will enable Fault Tolerance Systems to adapt to faults occurring in the operating environment in a holistic manner.
OpenSolaris NFS over RDMA Project
The aim of this project is to design and develop a high performance RDMA-enabled RPC transport for the NFS client and server in OpenSolaris. Existing transports based on TCP and UDP are limited by copying overhead, resulting in low aggregate bandwidth and high CPU utilization. In addition, the current RDMA-Read (Read-Read design) based design has performance and security limitations.
Efficient Shared Memory on High-Speed Interconnects
Software Distributed Shared Memory (DSM) systems do not perform well because of the combined effects of increase in communication, slow networks and the large overhead associated with processing the coherence protocol. Modern interconnects like Myrinet, Quadrics and InfiniBand offer reliable, low latency, and high bandwidth. These networks also support efficient memory based communication primitives like Remote Memory Direct Access (RDMA). These supports can be leveraged to reduce overhead in a software DSM system.
High Performance Networking for TCP-based Applications
The Transmission Control Protocol (TCP) is one of the universally accepted transport layer protocols in today's networking world. The introduction of gigabit speed networks a few years back had challenged the traditional TCP/IP implementation in two aspects, namely performance and CPU requirements. In order to allow TCP/IP based applications achieve the performance provided by these networks while demanding lesser CPU resources, researchers came up with solutions in two broad directions: user-level sockets and TCP Offload Engines. Both these approaches concentrate on optimizing the protocol stack either by replacing the TCP stack with zero-copy, OS-bypass protocols such as VIA, EMP or by offloading the entire or part of the TCP stack on to hardware. However, these offloading techniques have not been completely successful in avoiding the multiple copies and kernel context switches, mainly due to certain fundamental bottlenecks in the traditional TCP implementation including the memory traffic associated with the data streaming semantics, buffering for out-of-order packets, and several others.
Design and Evaluation of Advanced Message Queuing Protocol (AMQP) over InfiniBand
Message Oriented Middleware (MOM) plays a key role in enterprise data distribution. The strength of MOM is that it allows for communication between applications situated on heterogeneous operating systems and networks. MOM allows developers to by-pass the costly process of building explicit connections between these varied systems and networks. Advanced Message Queue Protocol (AMQP) has emerged as an open standard for MOM communication. AMQP grew out of the need for better messaging integration both within and across enterprise boundaries. InfiniBand promises to be a high performance network platform for AMQP communication.
NIC-level Support for Collective Communication and Synchronization with Myrinet and Quadrics
Collective operations such as barrier, broadcast, and reduction are very important in many parallel and distributed programs. For example, an efficient implementation of barrier is important because while processors are waiting on a barrier, generally, no computation can be performed, which impacts parallel speedup. The efficiency of barrier operations also affects the granularity of parallel computation. Some Network Interface Cards (NICs) such as Myrinet have programmable processors which can be customized to support collective communication.
NIC-level Support for Quality of Service (QoS)
Emerging applications targeted for clusters are inherently interactive and collaborative in nature. These applications demand end-to-end Quality of Service (QoS) in addition to performance. Achieving predictable performance and ability to exploit resource adaptivity are also common requirements of these next generation applications. Providing QoS mechanisms for clusters to satisfy the demands of next generation applications is a challenging task.
Design of Scalable Data-Centers over Emerging Technologies
Current data-centers lack in efficient support for intelligent services, such as requirements for caching documents and cooperation of caching servers, efficiently monitoring and managing the limited physical resources, load-balancing, controlling overload scenarios, that are becoming a common requirement today. On the other hand, the System Area Network (SAN) technology is making rapid advances during the recent years. Besides high performance, these modern interconnects are providing a range of novel features and their support in hardware (e.g., RDMA, atomic operations, IOAT, etc.). We propose a framework comprising of three layers (communication protocol support, data-center service primitives and advanced data-center services) that work together to tackle the issues associated with existing data-centers. The following figure shows the main components of our framework.